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Abstract 

The  research  and  development  objectives  and  results  obtained  over  the  course  of  MRC’s  Integrated 
Sensing  and  Processing  contract  are  compiled  in  this  document.  The  effort  comprised  several  tech¬ 
nical  areas  in  radar-  and  communication-signal  processing  that  could  substantially  benefit  from  an 
integration  of  sensors  with  processing.  In  particular,  we  studied  problems  related  to  the  statisti¬ 
cal  and  algebraic  characterization  of  sources  and  sensors,  the  exploitation  of  channel  and  source 
statistics  for  improved  communication  in  harsh  environments,  the  determination  of  the  minimum 
required  link  bandwidth  and  associated  quantization  scheme  for  transmission  of  local  sensor  es¬ 
timates  to  other  sensors  or  to  a  central  processor,  and  problems  in  the  area  of  decision-directed 
sensing  and  sensor-network  management.  In  this  latter  technical  area,  we  focused  on  the  sensor¬ 
scheduling  problem  using  partially  observable  Markov  decision  processes  and  on  the  problem  of 
combining  multiple  single-modality  target  classifiers  to  form  a  hyperclassifier  whose  performance 
exceeds  that  of  any  of  the  constituent  classifiers  while  minimizing  the  amount  of  data  requested 
from  the  array  of  available  sensor  modalities. 
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1  Introduction 

This  document  provides  the  final  report  for  DARPA/AFRL  contract  F33615-02-C-1198  “Inte¬ 
grated  Sensing  and  Processing  (ISP).”  The  structure  of  the  report  is  as  follows.  An  executive 
summary  is  provided  in  Section  2,  and  Section  3  provides  links  to  all  the  technical  reports  and 
published  papers  produced  under  the  contract.  If  this  report  is  being  read  electronically,  these  re¬ 
ports  can  also  be  viewed  by  clicking  the  colored  text  representing  the  report  titles.  Concluding 
remarks  are  made  in  Section  4. 


2  Executive  Summary 

2.1  Overview  of  Project  Objectives 

The  original  objectives  of  the  MRC-CSU  ISP  program  are  stated  in  the  following  list,  which  is 
taken  from  the  technical  proposal  submitted  to  DARPA. 

1.  Develop  constituent  mathematical  tools  for  ISP.  Here  the  objective  is  to  develop  tools  for 
integrating  sensing  and  processing  over  as  wide  a  range  of  application  areas  as  possible.  In 
particular,  tools  were  to  be  developed  relating  to  the  following  fundamental  signal  processing 
areas: 

(a)  Characterization  of  sources  and  sensors. 

(b)  Exploitation  of  channel  and  target  statistics. 

(c)  Sensor  deployment  and  clustering. 

(d)  Bandwidth  allocation  and  information  quantization. 

(e)  Decision-directed  sensing  and  processing. 

2.  Construct  a  general  ISP  framework.  Here  the  idea  is  to  refine  the  developed  tools  and  cre¬ 
ate  a  framework  for  developing  the  interfaces  between  the  tools,  which  can  then  be  adapted 
to  a  specific  problem  of  interest. 

(a)  Refinement  of  developed  mathematical  tools. 

(b)  Integration  of  mathematical  tools  into  a  general  framework. 

(c)  Validate  framework. 

3.  Apply  ISP  tools  and  framework  to  two  problems  in  radar  and  communications.  In  this 
third  major  objective,  the  idea  is  to  apply  the  developed  ISP  tools  and  framework  to  each  of 
two  major  problems  of  interest  to  the  government:  automatic  target  recognition  and  high¬ 
speed  MIMO  communication.  The  outcome  would  be  a  quantitative  characterization  of  the 
performance  or  cost  benefits  of  utilizing  the  new  ISP  tools  in  these  familiar  settings. 

(a)  Radar-based  automatic  target  recognition. 

(b)  Multi-input  multi-output  communication  link. 
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After  the  October  2003  ISP  Program  Review,  DARPA  decided  not  to  exercise  the  FY04  and  FY05 
funding  options  for  this  effort.  This  lead  to  a  serious  descoping  from  our  original  objectives;  the 
adjusted  program  scope  is  described  in  the  following  subsection.  The  reduced  set  of  objectives  was 
approved  by  the  AFRL  and  DARPA  program  managers. 

2.1.1  Program  Objectives  After  Funding  Decrease 

1 .  Develop  constituent  mathematical  tools  for  ISP. 

(a)  Characterization  of  sources  and  sensors. 

i.  Beamforming  versus  diversity  combining,  connections  between  radar  scattering 
functions  and  time-frequency  distribution  analysis,  connection  between  sensor- 
network  control  and  radar-parameter  adaptation  [CSU] . 

(b)  Decision-directed  sensing  and  processing. 

i.  Partially  observable  Markov  decision  processes  (POMDPs)  for  control  and  man¬ 
agement  of  sensor  networks  [CSU]. 

ii.  Binary  hypertrees  for  automatic  target  recognition  [MRC]. 

2.2  Technical  Approach 

The  technical  approach  employed  for  each  of  the  three  technical  objectives  is  described  at  a  high 
level  in  this  section.  The  approaches  employed  for  two  additional  objectives  that  were  pursued 
prior  to  the  descoping  are  also  described.  For  more  detail,  please  consult  the  appropriate  technical 
report  in  Section  3. 

2.2.1  Exploitation  of  Channel  Statistics  (Pre-Descoping) 

We  focus  on  the  exploitation  of  communication-channel  statistics  to  enable  radio  communication 
that  is  more  robust  to  time-varying  channel  conditions,  such  as  multipath  and  cochannel  inter¬ 
ference.  The  operational  idea  is  to  extend  the  notion  of  rate-adaptive  communication  links  to 
modulation-adaptive  communication  links.  In  these  latter  links,  many  more  system  parameters  are 
allowed  to  vary  over  time  with  respect  to  the  former  links,  in  which  only  the  constellation  of  the 
employed  digital  QAM  signal  is  allowed  to  vary.  To  develop  this  idea,  we  followed  an  information- 
theoretic  technical  approach.  In  particular,  we  developed  an  abstracted  version  of  the  problem  and 
modeled  the  communication  system  and  the  physical  channel  as  first-order  Markov  random  pro¬ 
cesses.  This  leads  to  a  model  of  the  communication  system  that  is  constant  over  fixed  periods  of 
time,  during  which  it  is  characterized  as  a  discrete  memoryless  channel.  We  developed  formulas 
for  the  capacity  of  such  an  adaptive  system,  which  build  on  known  formulas  for  a  fixed  system  in 
the  face  of  a  time-varying  physical  channel. 

2.2.2  Bandwidth  Allocation  and  Quantization  (Pre-Descoping) 

The  technical  approach  here  is  to  formulate  an  optimization  problem  in  which  the  sensors  in  a 
network  can  quantize  their  raw  information  and  transmit  it,  or  they  can  transform  the  information, 
then  quantize  and  transmit.  The  idea  is  to  determine  whether  any  particular  ordering  of  operations 
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is  required  for  best  performance.  The  results  of  this  research  can  then  be  used  in  the  design  of 
quantization  and  transmission  elements  for  wireless  sensor  networks.  In  particular,  it  may  be  quite 
beneficial  for  each  sensor  to  quantize  and  transmit  to  a  more  sophisticated  central  processor  rather 
than  outfit  the  sensors  themselves  with  sophisticated  estimators  and  quantizers. 

2.2.3  Characterization  of  Sources  and  Sensors 

The  technical  approach  for  characterization  of  sources  and  sensors  is  to  attempt  to  unify  aspects  of 
radar  signal  processing  with  wireless  communication  signal  processing.  In  particular,  an  attempt  is 
made  to  unify  the  ideas  of  time-frequency  distributions  and  radar  scattering  functions.  The  benefit 
of  this  approach  is  that  if  a  connection  is  forged,  then  the  large  body  of  work  on  time-frequency 
distributions  may  be  brought  to  bear  on  the  scattering-function  estimation  (channel  estimation) 
problem. 

2.2.4  POMDPs  for  Sensor  Network  Control  and  Management 

The  technical  approach  employed  for  sensor  network  control  and  management  is  first  to  focus 
on  the  sensor  scheduling  subproblem,  and  then  to  apply  partially  observable  Markov  decision 
processes  to  this  subproblem.  The  scheduling  problem  is  a  serious  one,  involving  as  it  does  the 
ultimate  tracking  performance  of  the  network  as  well  as  the  lifespan  and  long-term  utility  of  the 
network.  Careful  scheduling  of  the  on  and  off  states  of  the  sensors  can  substantially  increase 
the  network  lifetime  and  permit  the  network  to  be  maximally  useful  (though  degraded,  perhaps) 
throughout  its  lifespan.  The  POMDP  formulation  of  the  problem  uses  particle  filtering  to  estimate 
prior  probabilities  of  the  system  state  instead  of  assuming  any  particular  probabilistic  model,  and 
this  makes  the  approach  much  more  realistic.  The  performance  of  the  scheduler  is  compared  to  the 
closest-point-of-approach  algorithm,  a  simple  and  popular  alternative,  which  cannot  make  use  of 
crucial  sensor  attributes  such  as  individual  error  statistics,  current  power  level  remaining,  cost  of 
use,  etc. 

2.2.5  Binary  Hypertree  Classifiers  for  ATR 

The  technical  approach  here  is  to  build  on  previously  developed  tree-based  classifiers  for  ATR  . 
This  classifier  approach  employs  the  local  discriminant  basis  (LDB)  to  automatically  determine 
the  best  wavelet  representation  of  the  set  of  class  inputs  for  the  purpose  of  classification.  This 
should  be  contrasted  with  the  standard  wavelet  approach  of  finding  the  best  wavelet  representation 
(basis)  for  a  set  of  inputs  for  the  purpose  of  compression.  To  integrate  sensing  and  processing,  we 
envision  an  ATR  system  that  has  at  its  disposal  several  sensing  modalities  (e.g.,  different  camera 
types  or  a  set  of  distinct  radar  waveforms  and  bandwidths).  A  tree -based  recognizer  is  constructed 
for  each  of  the  modalities  and  as  many  measurement  functions  (e.g.,  wavelet  types)  as  desired, 
creating  a  family  of  tree -based  classifiers.  The  hypertree  idea  is  to  link  these  trees  together  such 
that  if  an  ambiguous  node  is  reached  in  one  tree,  the  node  points  to  the  tree  having  the  best  chance 
of  removing  the  ambiguity.  This  tree  is  “jumped  to”  and  if  a  new  sensing  is  required,  the  data 
is  obtained.  In  this  way,  the  classification  is  performed  in  a  sequential  manner,  and  the  best  data 
subspace  for  classification  is  automatically  determined  by  adaptively  responding  to  the  data. 
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2.3  Programmatic  Approach 

The  programmatic  approach  involved  the  use  of  technical  resources  at  MRC,  Colorado  State  Uni¬ 
versity  (CSU),  and  a  consultant.  Funding  was  split  nearly  evenly  between  MRC  and  CSU.  MRC 
was  the  prime  contractor  and  CSU  was  the  single  subcontractor  used  for  ISR  The  single  consultant 
was  John  Gubner. 

The  AFRL  was  MRC’s  immediate  customer.  The  AFRL  provided  program  management  for 
several  of  the  DARPA  ISP  awardees. 

2.4  Accomplishments 

The  accomplishments  of  the  contract  involved  technical  progress  on  the  three  objectives  involved 
in  the  descoped  effort  as  well  as  progress  on  two  additional  objectives  prior  to  the  descoping.  These 
are  briefly  described  here  at  a  high  level.  The  accomplishments  are  described  in  more  detail  in  the 
technical  reports  of  Section  3. 

1.  Exploitation  of  channel  and  target  statistics  (pre-descoping):  ISP  for  generic  communi¬ 
cation  links.  The  capacity  formula  for  a  general  time-varying  digital  communication  system 
facing  a  time-varying  physical  channel-both  modeled  as  Markov  processes — was  obtained. 
The  ultimate  capacity  of  such  a  system  requires  that  the  formula  be  evaluated  for  an  infinite 
number  of  channel  uses.  We  implemented  the  general  formula  in  MATLAB  and  found  that 
evaluating  it  was  costly  for  even  ten  channel  uses  when  the  parameters  of  the  Markov  pro¬ 
cesses  and  the  particular  digital  communication  system  parameters  (e.g.,  alphabet  size)  were 
realistic.  Nevertheless,  we  were  able  to  show  that  the  capacity  of  an  ISP-enabled  adaptive- 
modulation  system  can  be  orders  of  magnitude  larger  than  that  for  a  static  system  facing  a 
time-varying  channel.  A  complete  technical  report  was  prepared  and  submitted.  It  can  be 
found  in  Section  3.2. 

2.  Bandwidth  allocation  and  quantization  (pre-descoping):  canonical  coordinates  for  trans¬ 
form  coding.  Assuming  the  additive  white  noise  model  for  quantization,  we  have  proved 
that  the  correct  coordinate  systems  for  quantization  are  the  systems  of  half  and  full  canonical 
coordinates.  Half  canonical  coordinates  minimize  the  trace  of  the  error  covariance  matrix, 
while  full  canonical  coordinates  minimize  the  determinant.  Others  have  previously  proved 
that  canonical  coordinates  are  optimum  for  rank  reduction  as  well.  Together  with  our  results, 
this  means  that  we  can  first  choose  a  coordinate  system  and  then  decide  how  many  bits  to 
spend  on  the  components.  See  Section  3.3. 

3.  Characterization  of  sources  and  sensors:  time-frequency  distributions  and  scattering 
functions.  We  have  studied  the  estimation  of  time-frequency  distributions  (TFDs)  and  esti¬ 
mation  of  wireless  scattering  functions  (SFs).  We  have  shown  that  the  most  general  quadratic 
estimator  of  each  that  is  delay-  and  modulation-invariant  may  be  written  as  a  convolution  of 
a  Rihaczek  TF  density  with  a  TF  kernel  function.  The  representation  illustrates  a  fundamen¬ 
tal  difference  in  the  design  aspects  of  the  two  problems.  In  TFD  estimation,  the  Rihaczek 
TF  density  is  the  raw  Rihaczek  function  for  the  time  series,  and  the  kernel  is  designed  to 
convolve  in  time  and  frequency  to  perform  a  smoothing  role.  In  SF  estimation,  the  kernel  is 
the  true  SF  and  the  Rihaczek  TF  density  is  that  for  a  transmitter  signal  designed  to  decon¬ 
volve  in  time  and  frequency  to  perform  an  inversion  role.  In  each  case,  the  obtained  Fourier 
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transform  identities  in  a  four-comers  diagram  allow  for  kernel  or  Rihaczek  design  in  the 
transformed  space  of  ambiguity  functions.  See  Section  3.1. 

4.  Decision-directed  sensing  and  processing:  POMDPs  for  sensor  network  control.  We 

have  developed  a  sensor-scheduling  algorithm  based  on  POMDPs.  Instead  of  relying  on  an¬ 
alytic  expressions  for  belief  states,  we  use  a  Monte  Carlo  approach  that  combines  particle 
filtering  for  non-Gaussian  nonlinear  belief-state  estimation  with  a  Q- value  approximation 
method  that  allows  long-term  look-ahead  and  thereby  avoids  producing  greedy  algorithms. 
Our  algorithm  is  shown  to  outperform  the  closest-point-of-approach  algorithm  using  simu¬ 
lated  data.  This  indicates  that  the  use  of  POMDPs  may  lead  to  much  more  efficient  sensor 
network  management,  and  therefore  much  longer-lived  sensor  networks.  See  Section  3.4.1. 

5.  Decision-directed  sensing  and  processing:  binary  hypertrees  for  ATR.  In  the  first  part 
of  this  work,  we  established  the  mathematical  foundations  for  representing  and  analyzing 
binary-tree  classifiers  that  are  based  on  exploitation  of  the  LDB.  We  defined  three  basic 
classifier  types: 

(a)  The  binary  tree  classifier  (BTC).  This  classifier  is  associated  with  a  single  modality  and 
a  single  measurement  function  (e.g.  wavelet  type). 

(b)  The  binary  hypertree  classifier  (BHC).  This  classifier  is  comprised  of  a  linked  set  of 
BTCs.  As  such,  it  encompasses  multiple  modalities  and/or  multiple  measurement  func¬ 
tions. 

(c)  The  binary  supertree  classifier  (BSC).  This  classifier  represents  optimal  (fusion)  per¬ 
formance.  It  is  a  BTC  but  its  input  is  the  concatenation  (or  tiling  for  two-dimensional 
inputs)  of  all  available  modalities. 

The  mathematical  work  established  the  basic  performance  ordering  of  {BTC}  <  BHC  < 
BSC.  The  mathematics  of  this  work  were  submitted  in  a  technical  report  and  can  be  found 
in  Section  3. 

In  the  second  part  of  the  hypertree  work,  we  implemented  the  three  basic  classifiers  in  MAT- 
LAB  to  provide  a  proof  of  concept  and  to  evaluate  performance  claims  made  in  the  analyt¬ 
ical  work.  We  studied  the  performance  of  the  classifiers  using  one-  and  two-dimensional 
synthetic  problems  and  by  applying  them  to  several  collected  public  data  sets.  A  key  ac¬ 
complishment  is  the  construction  of  an  algorithm  to  automatically  jointly  determine  a  good 
(BTC)  tree  topology  and  classifier  parameters  for  an  arbitrary  classification  problem.  We 
found  that  the  BTC  was  able  to  embody  the  essential  class  ambiguity  structure  of  the  prob¬ 
lems  under  study. 

For  the  one-dimensional  problem,  which  involved  the  sixteen  maximal-length  shift-register 
(MLSR)  sequences  for  shift-register  length  eight,  we  found  that  the  BTCs,  BHC,  and  BSC 
all  delivered  good-to-excellent  performance  and  that  performance  was  dependent  on  the  par¬ 
ticular  wavelet  type. 

For  the  two-dimensional  problem,  which  involved  four  modalities  and  eight  classes,  we 
found  that  the  predicted  performance  ordering  held  in  all  cases  and  that  performance  was 
not  particularly  sensitive  to  the  wavelet  type.  This  is  due  to  the  presence  of  severe  class 
ambiguities  (large  equivalence  classes)  for  each  of  the  modalities.  See  Section  3.4.2. 
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3  Technical  Reports  and  Publications 

When  viewing  this  document  electronically,  click  on  the  title  of  the  desired  technical  document  to 
view  it  in  your  default  PDF-file  viewer.  The  report  PDF  files  are  contained  in  subdirectories  of  the 
directory  containing  this  document.  If  you  are  reading  a  printed  form  of  the  document,  each  report 
is  an  appendix,  and  the  page  number  for  the  appendix  is  provided  next  to  the  document  title  in  the 
list  below. 


3.1  Algebraic  Characterization  of  Sources  and  Sensors 

1. 

[  ].  See  Appendix  A,  page  11. 

2.  [2], 

See  Appendix  B,  page  16. 

3.  .  See  Appendix  C,  page  17. 


3.2  Exploitation  of  the  Statistics  of  Propagation  Channels  and  Targets 


1. 


[  ].  See  Appendix  D,  page  40. 


3.3  Bandwidth  Allocation  and  Quantizing 


1. 

2. 


[5].  See  Appendix  E,  page  77. 
[  ].  See  Appendix  F,  page  96. 


3.4  Decision-Directed  Sensing  and  Processing 

3.4.1  Sensor  Management  and  Control 

1.  [9].  This  work 

was  also  published,  with  small  differences,  in  [  ]  and  [8].  See  Appendix  G,  page  100. 

3.4.2  Hypertree  Classifiers 

1.  [10],  See  Ap¬ 
pendix  H,  page  114. 

2.  [  ].  See  Appendix  I,  page 

160. 
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3.5  Applications  of  the  Mathematical  Methodology 

Due  to  the  funding  constraints  imposed  on  MRC  during  the  performance  of  this  contract,  the 
developed  algorithmic  technology  was  not  applied  to  any  real-world  defense-related  collected  data 
sets. 


4  Conclusions 

We  make  the  following  conclusions  based  on  our  ISP  work  under  this  contract. 

1.  For  general  communication  links,  a  large  increase  in  capacity  can  be  obtained  by  employ¬ 
ing  the  ISP-inspired  notion  of  modulation  adaptation.  This  notion  generalizes  rate-adaptive 
signaling  to  modulation-adaptive  signaling  by  allowing  multiple  aspects  of  the  transmitted 
waveform  to  be  adapted,  such  as  the  transmission  band,  modulation  type,  coding,  and  mod¬ 
ulation  rates. 

2.  For  sensor  networks,  partially  observable  Markov  decision  processes  (POMDPs)  appear  to 
be  very  useful  for  the  sensor-scheduling  problem.  In  particular,  this  approach  naturally  al¬ 
lows  various  real-world  sensor  constraints,  such  as  battery  life,  current  power  level,  cost  of 
operation,  etc.,  to  be  taken  into  account  in  a  dynamic  fashion.  Such  an  approach  leads  to 
more  efficient  use  of  the  total  network  resources,  and  can  thereby  extend  the  useful  lifespan 
of  a  sensor  network. 

3 .  For  automatic  target  recognition,  we  have  found  that  the  notion  of  hypertree  classification  has 
significant  merit.  In  particular,  the  developed  method  automatically  finds  the  best  sequential 
classifier  that  can  be  built  from  the  collection  of  constituent  classifiers,  any  one  of  which  can 
be  arbitrarily  bad.  A  significant  outcome  of  this  work  is  the  development  of  an  automated 
training  algorithm  for  jointly  determining  the  tree  topology,  class  splits,  and  feature  vectors 
for  an  arbitrary  classification  problem. 
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ABSTRACT 

In  this  paper  we  study  two  problems:  estimation  of  time- 
frequency  distributions  (TFDs)  and  estimation  of  wireless 
or  radar  scattering  functions  (SFs).  We  show  that  the  most 
general  quadratic,  delay-  and  modulation-invariant  estima¬ 
tor  of  each  may  be  written  as  a  convolution  of  a  Rihaczek 
TF  density  with  a  TF  kernel.  This  representation  comple¬ 
ments  other  equivalent  representations  and  establishes  a  fun¬ 
damental  connection  between  the  analysis  of  the  two  prob¬ 
lems.  However,  the  representation  also  illustrates  a  funda¬ 
mental  difference  in  design  for  the  two  problems.  For  TFD 
estimation,  the  Rihaczek  TF  density  is  the  raw  Rihaczek  for 
the  time  series  and  the  kernel  is  designed  to  convolve  in  time 
and  frequency  for  smoothing.  For  SF  estimation,  the  kernel 
is  the  true  SF  and  the  Rihaczek  TF  density  is  that  for  a  trans¬ 
mitter  signal  designed  to  deconvolve  in  time  and  frequency 
for  inversion.  In  each  case,  the  Fourier  transform  identities  in 
a  four-corners  diagram  allow  for  kernel  or  Rihaczek  design 
in  the  transformed  space  of  ambiguity  functions.  Design  in 
this  space  then  produces  spectrograms  and  interferograms 
for  TFD  estimation,  and  ideal  transmitter  signals  for  SF  es¬ 
timation. 

1.  INTRODUCTION 

In  this  paper  we  offer  yet  another  representation  for  the  Co¬ 
hen  class  [1]  of  quadratic,  delay-  and  modulation-invariant 
time-frequency  distributions  (TFDs),  this  one  based  on  a 
convolution  of  the  Rihaczek  TF  density  with  a  TF  kernel. 
This  representation  is  used  to  produce  spectrograms  and  in¬ 
terferograms  which  are  practical  estimators  of  TFDs. 

We  then  ask  whether  this  representation  has  any  rel¬ 
evance  to  the  estimation  of  scattering  functions  (SFs)  for 
wireless  and  radar  channels.  The  answer  is  yes.  In  fact, 
by  following  Gaarder’s  original  arguments  [2],  we  find  that 
the  most  general  quadratic,  delay-  and  modulation-invariant 
estimator  of  the  SF  has  the  same  representation,  namely  a 
convolution  of  a  Rihaczek  TF  density  with  a  TF  kernel.  But 
here  the  similarities  end,  for  there  is  a  key  difference  in  de¬ 
sign  philosophy  for  TFD  estimation  and  SF  estimation. 

In  TFD  estimation,  the  Rihaczek  density  is  the  raw  Ri¬ 
haczek  for  the  time  series,  and  the  TF  kernel  is  a  free  design 
variable.  This  kernel  is  designed  to  smooth  in  time  and  fre¬ 
quency.  In  SF  estimation,  the  TF  kernel  is  the  true  SF  and 
the  Rihaczek  TF  density  is  a  free  variable.  This  Rihaczek  is 
actually  the  Rihaczek  for  the  transmitted  signal,  which  is  de¬ 
signed  to  invert  in  time  and  frequency  for  the  true  SF.  Thus, 


for  TFD  estimation  the  problem  is  one  of  design  for  convolu¬ 
tion.  whereas  for  SF  estimation  the  problem  is  one  of  design 
for  deconvolution.  In  each  case,  the  Fourier  transform  iden¬ 
tities  in  a  four-corners  diagram  allow  for  kernel  or  Rihaczek 
design  in  the  transformed  space  of  ambiguity  functions.  De¬ 
sign  in  this  space  then  produces  spectrograms  and  interfero¬ 
grams  for  TFD  estimation,  and  ideal  transmitter  signals  for 
SF  estimation. 

The  equations  we  derive  are 

P,dLf)  =  I J X(/V'2*/V x*(t')e(t  -t' J  -  f)df'dt' 
for  TFD  estimation,  and 

Pa{  T,V)  =  J  J  X(v')ej2KV'T'x*(T')Pa(T-x',v -v')dv'dt' 

for  SF  estimation.  Thus,  for  TFD  estimation  the  prob¬ 
lem  is  to  design  the  kernel  e(t,f )  so  that  the  raw  Rihaczek 
X  (f)e^271^ x*  ( t )  is  smoothed,  whereas  for  SF  estimation  the 
problem  is  to  design  the  transmitted  signal  x(t)  so  that  its 
raw  Rihaczek  will  invert  for  the  unknown  scattering  function 
Pa( T,  v)  (also  called  the  channel  ambiguity  function).  Thus, 
while  the  design  objectives  are  different,  the  defining  anal¬ 
ysis  equations  are  identical.  The  transform  duals  of  these 
equations  are 

txx(Af,At)  =  w(Af,  At)r„(Af,  At) 
and 

E(Tyy(Af,At))  =  Rh  (A/ ,  At  )T„(A/  ,At), 

where  Txx,  w,  and  RH  are  the  2D  Fourier  transforms  of 
X{f)e^2K^x*{t).  e ,  and  Pa,  respectively.  For  TFD  es¬ 
timation,  w(A/.  At)  is  designed,  and  for  SF  estimation, 
rxx(Af,At)  is  designed. 

2.  RIHACZEK  FOUR-CORNERS  DIAGRAM 

A  four-corners  diagram  can  be  used  to  illustrate  relationships 
between  time-frequency  distributions,  time-varying  system 
representations,  as  well  as  related  correlations  and  convolu¬ 
tions.  Consider  the  four-corners  diagram  in  Figure  1.  We  be¬ 
gin  in  the  East  with  the  Rihaczek  [3]  complex  cross-energy 
density, 

rf,g(t,f)=F(f)ej2*fig*(t).  (1) 
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3.  THE  SUSSMAN  FOUR-CORNERS  DIAGRAM 


Figure  1 :  Rihaczek  four-corners  diagram. 


With  fF  z  J.  denoting  the  inverse  Fourier  transform  oper¬ 
ator  from  /  to  t,  we  may  move  North  to  produce 

\l/fg(t,z)=^llfYfjg(t,f)=f(t  +  T)g*(t).  (2) 

Moving  from  East  to  South, 

'Vf,g(v,f)=3?v,Yf,g(tJ)  =  F(f)G*(f  —  v),  (3) 

and  from  Noith  to  West  to  find  the  cross-ambiguity  function 
for  /  and  g, 

T =  ^vf\\rfg{t,x)  =  jj{t  +  x)g*{t)e~j2llvt  dt. 

(4) 

The  move  from  South  to  West  yields  an  equivalent  expres¬ 
sion  for  the  cross-ambiguity  function: 

Tf  (v,T)  =  %j1'Pf Jv,f)  =  [F(f)G*(f-v)e^df. 

J  — CO 

(5) 

Defining  the  modulation  and  time-shift  operators  M  , 
and  Tto  as  Mfog{t)  =  g (t)eJ2Kf°(  and  Ttgg(t)  =  g(t  -  ta),  we 
easily  find  covariant  connections 

rM/o/,,(v,T)  =  eP*f°'rfJy  -f0,T), 

Tmf/v^  =  Tf^v+f°^ 

rfj,og(v’z)  =  e~j2KV,orf,g(v’T+to)’ 

^Mf  f,Mf  g(V,T^  =  ej2ltfaTF f  (V  +  fb  ~  fa,  T) , 
and  YT.  f,T,  g(v,T)  =e-j2ltv,brLg(v,T  +  tb-ta). 

a  /j 

The  Fourier  Transform  preserves  inner  products.  Since 

r/,g(v^)  =  ^v,r we  find  (Yf*>Yy*)  = 
irf,g,ry.x),  where 

(7/jg;  7v,.v>  =  [  [  (Yf,gtfs)(ff)dtdf=  {f,y)(g,x)*. 

(6) 

This  result  is  known  as  Moyal’s  formula  [4]: 

(r^r^)  =  {yfjs,  Yy,x)  =  ( f,y)(g,x >*.  (7) 


Sussman  [5]  was  apparently  the  first  to  state  a  particularly 
useful  identity  for  ambiguity  functions.  Consider  the  Suss¬ 
man  four-corners  diagram  in  Figure  2. 


Figure  2:  Sussman  four-corners  diagram. 


Begin  with  the  conjugate  product  in  the  West 

w(V,T)  =r/rf(V,T)r^t(V,T).  (8) 

Since  the  North  function  is  n(t,  t)  =  <^fl w(v,  t),  we  know 
that  n(t,  t)  is  the  following  correlation  in  t  and  conjugate 
product  in  T  (refer  also  to  Figure  1): 

n(t,  t)  =  f  \]/f  (t'  +  t,x)\fijX(t',T)dt' 

J  — oo 

=  [  f(t' +t  +  x)g*(t' +t)y*(t' +  x)x(t')dt’ 

=  +  T,t)Wlx(t',t)dt'. 

Note  that  to  obtain  the  last  line  from  the  first  above,  exchange 
t  with  t  and  g  with  y.  Similarly,  the  remaining  corners  of 
Figure  2  can  be  obtained.  The  relationship  between  the  West 
and  the  East  of  Figure  2  is  known  as  the  Sussman  identity: 

^f^rfg(v,T)r;^v,T)  (9) 

It  is  worth  noting  that  F  *  g(v,  t)  is  an  ambiguity  function  of 
the  local  delay  and  doppler  variables  %  and  v,  and  F 
is  an  ambiguity  function  of  the  global  frequency  and  time 
variables  /  and  t.  It  is  also  worth  noting  that  from  (9) 

(r/,vr;,v)(0,0)  =  f  J  (FuF^)(v,x)dvdT; 

i.e., 

(rAvr;,)(o,o)  =  {f,y)(g,x)*  =  {rfJS,Fyj, 

so  that  Moyal’s  formula  can  be  viewed  as  a  special  case  of 
the  Sussman  identity. 
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4.  TIME-FREQUENCY  ESTIMATION 
FOUR-CORNERS  DIAGRAM 

We  consider  the  two-channel  Rihaczek-based  TF  (time- 
frequency)  estimate 

/OO  poo 

/  e(t-t'J-f')Yf  (t'J')dt'df'  (10) 

-OO  J  — OO 

in  the  East  of  Figure  3,  where  Jj  g(l,/)  's  the  Rihaczek  TF 
distribution  of  (1).  In  Figure  3,  *,  denotes  convolution  with 
respect  to  the  ;th  variable,  and  *  denotes  convolution  with 
respect  to  both  variables. 

With  the  aid  of  Figure  1  and  well  known  properties  of 
Fourier  transforms,  the  remaining  corners  are  easily  veri¬ 
fied.  We  may  interpret  the  TF  estimate  in  the  West  as  a  win¬ 
dow  w(v,  t)  applied  to  the  signal  cross-ambiguity  function 

F/.,( '•')• 


Figure  3:  Rihaczek-based  four-corners  diagram  for  TF  esti¬ 
mation. 


In  [6]  a  discrete-time  version  of 


P(f,f)=J j h)dtldt2 


(11) 

is  considered  and  shown  to  be  another  representation  for  the 
most  general  member  of  Cohen’s  class  [1]  of  TF  distribu¬ 
tions,  which  are  quadratic,  time-  and  frequency-translation 
invariant.  Making  the  change  of  variables  t1  =  X  —  t'  and 
?2  =  —  t'  in  (1 1)  we  obtain 


=  J  J  Q^-t\-t')Vf,g(t-t',T)e  i2nfzdt'dT 

=  ^f,r(n* j  I //ig)(f,T)  =  (e*Yf  g)(t,f)  =Pe{t,f), 


where  we  have  used  n(t,  t)  =  Q(r  —  t,  — f),  so  that  Q(tl,t2)  = 
n(—t2,t i  — t2).  It  follows  that  the  most  general  form  of  TF 
estimate  that  is  quadratic  and  (f ,/)  translation  invariant  can 
be  expressed  in  the  form  of  Pe(t,f)  in  (10). 

Using  Figure  3,  we  can  write 

Pe(t,f)  =  j  J  ^  fg(vJ  -  f')s(y,f')ei2nvtdf'dv. 
Fetting  v  =  f2  —  j\  and  f  =  j\ ,  we  obtain 


Pe{tJ)=jjF(f-fx)G\f-f2)Q(fxJ2)e^-fd’dfldf2, 


(12) 


where  we  have  used  Q{fvf2)  =  & fihQ(tvt2)  = 

<^-f2,,2e(-t2ifiy27Cfft2  =  s(fi  ~  fvfil  or  s(v>/)  = 

Q(f-  V  +/)■  This  is  a  frequency-frequency  smoothed  ver¬ 
sion  of  (10),  and  the  dual  of  (11).  Thus,  (10),  (11),  and  (12) 
are  t-f,  t-t,  and  /-/  representations  for  the  Cohen  class. 

5.  INTERFEROGRAMS  AND  SPECTROGRAMS 

In  Figure  3,  if  the  West  window  w(v,t)  =  T*  (v,  r),  is 
itself  an  ambiguity  function  involving  two  one-dimensional 
window  functions  Vj(-)  and  v2(-),  then  using  (9)  the  TF  esti¬ 
mate  becomes 


pe(t,f)  =  rftVi  (f,t)r*gjV2(f,t)-  (13) 

If  rVpV,(v,T)  depends  on  some  parameter  ©,  say 
rvj,v2(v,T)  =  rVi!v2(v,  t;@),  then  we  may  consider  a  West 
window  as  a  linear  combination  of  the  form 

w(v,t)  =  j  r*iV2  (v ,  t;  ©)  W  (0)  d®  (14) 

for  a  continuous  parameter  space,  or 

w(v,T)=^r:iV2(v,T;©,.)W(©,.)  (15) 

i 

for  a  discrete  parameter  space,  where  W (©)  is  a  weighting 
function. 

If  iq  ( t )  =  v2(t)  =  M_jv{t)  and  W (©)  =  V0(f0),  then 

/(X) 

rv,v(v,  r)V0(fo)ej2,tfoT  df0, 

-OO 

corresponding  in  the  North  to  n(t,r)  =  y/*v(—t,  t)v0(t),  or 
Q(tl,t2)  —  v*(f1)v0(f1  —  t2)v(t2),  which  is  a  dTd  (diagonal- 
Toeplitz-diagonal)  factorization  of  Q(t ,  ■  F)-  The  resulting 
TF  estimate  is  the  weighted  frequency-averaged  spectrogram 

PeitJ)  =  j  r /v(/  -  f0,t)T*gtV{f  -  f0,  t)Vo(fo)  dfo. 

If  Vj(f)  =  v2(t)  =  T_tgv(t)  and  W(©)  =  v0(t0),  then 

/oo 

T^v{v,x)v0{t0)e~j2nvto  dto, 

-oo 

corresponding  in  the  South  to  s(v,f)  =  xP*v(v,  — /)V0(v), 
or  Q(fi,f2 )  =  V*{-f\)Vo{f2  -/i)T(-/2),  which  is  a  dTd 
factorization  of  Q{j\  -fn)-  The  resulting  TF  estimate  is  the 
weighted  time-averaged  spectrogram 

Pe(t,f)  =  J  rfv(f,t-t0)r*jV(f,t-t0)v0(t0)dt0. 

If  iq  (f)  =  Mj,  i2v(t),  v2{t)  =  M_j  i2v(t),  and  W(©)  = 
V0(f0),  then 

w(v,T)  =  r  r*v(v  -/0,  t)V0{f0)e-j2x^x dfo, 

J  — oo 
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corresponding  in  the  North  to  n(t,  r)  =  y/*v(—t,  r)v0(t  — 

|),  or  Q(tvt2)  =  v*(t1)v0(—^z)v(t2),  which  is  a  dHd 
(diagonal-Hankel -diagonal)  factorization  of  Q(t , .  ) .  The 
resulting  TF  estimate  is  the  weighted  frequency-averaged  in- 
terferogram 

Pe(t,f)  =  I  r/iV(/+^,on,v(/- y,0vo(/o)rf/o. 

If  Vj(f)  =  T,  v{t),  v2{t)  =  T  ,  v(t)  and  W(@)  =  v0(t0), 

T  “T 

then 

/°°  .  t 

r*  V(v,  T  -  to) Vo{to)e~j2nv^  dto, 

-oo 

corresponding  in  the  South  to  s(v,f )  =  xP*jV(v,—f)V0(f  + 
1).  or  Q(fvf2)  =  y*(-/1)y0(A+4)y(-/2),  which  is  a 
dHd  factorization  of  Q(  j\  ,/2).  The  resulting  TF  estimate 
is  the  weighted  time-averaged  Wigner-Ville  distribution 

Pe(t,f)  =  I F t  +  |)r*5V(/, t  -  t^)e~j2llftov0{to) dtg. 

6.  TIME- VARYING  LINEAR  SYSTEMS 
FOUR-CORNERS  DIAGRAM 

Consider  a  linear  time-varying  system  with  input  delay- 
spread  function  h(t,  t)  (the  North  in  Figure  4),  input  x(t)  and 
output  y(t).  The  input-output  relationship  [7]  is 

y(t)  =  f  h(t,x)x(t  -  t)  dr  (16) 

J  —  oo 

in  terms  of  the  input  delay-spread  function,  and 

/oo 

^H(t,f)X(f)ej2nftdf  (17) 

in  terms  of  the  time-varying  frequency  response 
Standard  Fourier  transform  identities  may  be  used  to  fill  out 
Figure  4  and  write  input-output  equations  using  the  input 
delay-doppler  spread  function  c(v,  t)  or  the  output  doppler- 
spread  function  B(v,f). 


Figure  4:  Basic  input-output  characterizations  for  LTV  sys¬ 
tem. 


7.  WIDE-SENSE  STATIONARY  AND 
UNCORRELATED  SCATTERING  CASE 

When  the  WSS  and  US  assumptions  are  combined  (yielding 
the  WSSUS  assumption),  all  of  the  two-dimensional  Fourier 
transform  relationships  are  reduced  to  one -dimensional 
Fourier  transform  relationships  [7],  as  illustrated  in  Figure  5. 
It  is  important  to  note  that  the  quantity  Pa( t,  v)  in  Figure  5  is 
commonly  called  the  scattering  function.  It  is  also  important 
to  note  that  only  the  West  corner  of  Figure  5, 

RH(Af,  At)  =  E(H(t  +  XtJ)H*(t,f-  A/)), 

is  a  correlation  function.  The  remaining  corners  are  power 
densities,  from  which  singular  correlations  may  be  con¬ 
structed  by  applying  5-functions.  In  Figure  5,  the  global 
variables  (t,  v)  play  the  same  role  as  (t,f)  previously,  and 
the  local  variables  (Af,At)  play  the  same  role  as  (v,  t)  pre¬ 
viously. 


Figure  5:  Four-corners  diagram  for  WSSUS  case. 


8.  CONNECTIONS  BETWEEN  TF  DISTRIBUTIONS 

AND  SCATTERING  FUNCTION  ESTIMATION 

We  consider  a  WSSUS  channel  with  deterministic  input  sig¬ 
nal  x(f),  input  delay-spread  function  h(t,  t)  and  output  y(t). 
The  mean  of  the  ambiguity  function  for  y  is 

E(Tyj(Af,At))  =RH(Af,At)TXiX(Af,At),  (18) 

which  is  illustrated  in  Figure  6.  Equation  (18)  is  the  funda¬ 
mental  result  connecting  input  ambiguity  to  output  ambigu¬ 
ity,  through  the  time-frequency  correlation  RH.  Application 
of  Fourier  transform  properties  produces  the  remaining  cor¬ 
ners  of  Figure  6. 

Comparing  Figure  6  with  Figure  3,  we  find  that  estima¬ 
tion  of  the  scattering  function  (SF)  Pa  is  the  same  as  esti¬ 
mation  of  the  TF  distribution  except  for  the  change  in  de¬ 
sign  rules:  for  TF  we  design  e(t,f)  to  estimate  TF  proper¬ 
ties  of  signal  x(-)  (/  =  g  =  x  in  Figure  3);  for  SF  we  design 
transmitted  signal  x  to  estimate  Pa.  For  TF  we  are  trying 
to  smooth  whereas  for  SF  we  are  trying  to  differentiate;  or 
convolve  for  TF  vs.  deconvolve  for  SF. 

Gaarder  [2]  proposed  a  two-stage  translation- variant  es¬ 
timate  for  the  scattering  function  using  symmetric  ambiguity 
functions  as  well  as  symmetric  correlation  functions.  Here, 
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10.  ACKNOWLEDGEMENT 


Figure  6:  Rihaczek-based  four-corners  diagram  for  SF  esti¬ 
mation. 


we  highlight  a  translation-invariant  version  of  Gaarder’s  es¬ 
timate  with  the  notational  conventions  of  the  current  paper. 
Stage  1  computes  in  the  East 

*(*,v)  =  |rMi(T,v)|2, 

which  corresponds  in  the  West  (with  the  aid  of  the  Sussman 
identity)  to 

w(Af,At)  =  (ry^nvhi)(Af,At). 

Stage  2  adds  a  multiplication  in  the  West  by  h2{Af.  At)  to 
obtain 

At  (A/,  At)  =  (h2-  T*hihi  -ryy)(Af,At), 
corresponding  in  the  East  to  the  scattering  function  estimate 
Pa(T,v)  =  (//2*|rv/li(T,v)|2)(T,v); 
or 

=  (Heq*Yyy)(  T,v), 

where 

heq(Af,At)  =  (h2-r*hvhiXAf,At). 

We  easily  find  that 

E(pa(Af,At))  =  ( heq-Txx-RH)(Af,At ), 
or  (letting  r(Af,At)  =  (heq  ■  rxx)(Af,At)) 

E(Pa(T,v))  =  (R*Pa)(  T,V). 

9.  CONCLUSION 

We  have  presented  a  unified  treatment  of  TFD  estimation  and 
SF  estimation  using  a  Rihaczek  foundation.  Four-corners 
diagrams  are  used  to  summarize  key  relationships.  A  fun¬ 
damental  key  to  the  performance  analysis  of  TFD  and  SF 
estimates  is  the  Sussman  four-corners  diagram. 


This  work  was  supported  by  the  DARPA  ISP  program  un¬ 
der  contracts  AFRL  F33615-  02-C-1198  and  FA9550-04-1- 
0371. 
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The  Sussman,  Moyal,  and  Janssen  Formulas  are 
Fourier  Transform  Consequences  of  a  More 
Fundamental  Identity 

David  C.  Farden,  Member,  IEEE, 
and  Louis  L.  Scharf,  Fellow,  IEEE 

Abstract — Janssen’s  formula  is  a  sampled-data  version  of  Moyal’s,  and 
both  follow  from  Sussman’s  identity,  which  itself  is  a  consequence  of  a 
more  fundamental  convolution  identity. 

Index  Terms — Sussman,  Moyal,  Janssen  identities. 

1.  Connections 

Let  il>fg(t,  t)  =  f(t  +  r)g*(t)  and  define  the  adjoint  r)  = 

ip*fg(—t,T).  The  fundamental  convolution  identity  we  shall  exploit  is 
this  [1]: 

(t/jfg  *1  t/vXf,  T)  =  (t/jfy  *1  Vv)(D  t),  (1) 

where  *i  denotes  convolution  with  respect  to  the  first  variable.  The 
proof  of  (1)  is  a  simple  exercise  in  convolution.  Note  the  swapping 
of  g  with  y,  and  t  with  t. 

Define  the  ambiguity  function 

/OO 

f{t  +  T)g*(t)e~32n''t  dt, 

-OO 

and  note  that  •7r„,t'i/>/g(t,  r)  =  Y*fg(v,  r).  Now  Fourier  transform  the 
LHS  of  (1)  from  t  to  u,  and  the  RHS  from  r  to  /.  Noting  that  each 
of  these  Fourier  transforms  is  with  respect  to  the  first  variable,  the 
Fourier  transform  identities  of  Fig.  1  are  readily  obtained.  In  Fig.  1 
we  have  defined  'J//g(u, /)  =  T f  i-F/g(u,  r),  and  \E'/g(^, /)  = 


Fig.  1.  Sussman  four-comers  diagram. 

The  two-dimensional  Fourier  transform  pair 

(rfgr*yx)(v,  r)  ^  (r/Br t),  (2) 

known  as  the  Sussman  identity  [2],  follows  from  the  fundamental 
convolution  equality  (1),  together  with  the  two  one-dimensional 
Fourier  transforms  that  separate  the  West  and  the  East  corners  of 
Fig.  1  by  a  two-dimensional  Fourier  transform. 

Now  integrate  the  LHS  of  the  Sussman  identity  (2),  over  (u,  t), 
and  denote  it  by  the  inner  product  notation  (r fg,  rg;E},  to  find 

(r/9>rgx)  =  (r/grga:)(o,o), 

which  is  just  an  initial  value  theorem  of  Fourier  analysis.  Then  note 
that  r/g(0,0)  =  </,  y),  meaning 

<r/g,r  yx)  =  (f,y){g,x)\  (3) 

D.  Farden  is  with  North  Dakota  State  University,  and  L.  Scharf  is  with 
Colorado  State  University. 


This  is  Moyal's  identity  [3],  Thus  Moyal's  identity  is  a  consequence 
of  Sussman’s  identity,  which  in  turn  is  a  Fourier  transform  version 
of  the  more  fundamental  identity  (1). 

The  sampled-data  version  of  Sussman’s  identity  is,  by  Poisson’s 
sum  formula, 

(F/gr ;x)(mF,  nT )  ^  ^  ^(F/.F^X/  +^,t+j).  (4) 

k,t 

Invoke  an  initial-value  theorem  to  write  the  sampled  data  version  of 
Moyal’s  formula  as 

^(r/gr;)K,nT)  =  _L^(r/gr^)(|,|), 

m,n  k,£ 

or,  in  terms  of  discrete  inner-products,  as 

(rfg,rvx)  =  ^(F/g,rgx).  (5) 

This  actually  a  generalized  version  of  Janssen’s  formula  [4],  That  is, 
when  gv,T{t)  =  g(t  —  r)ej2'n,'t ,  then  Y  fg(t/,  r)  =  {f,g^,T)ej27rl/T, 
and  we  may  write  the  sampled-data  version  of  Moyal’s  formula  (5) 
as 

T.  (/ 1  QmF  ,nT )  {y ,  tC-mF  ,nT  )  *  =  ^  ^  (/,  V  k, ,  t,  )  (ff,  X  k  _  4,  )  * , 
m,n  k,£ 

(6) 

which  is  the  usual  form  of  Janssen’s  identity  [4]. 

II.  Conclusion 

Thus,  the  equivalent  fundamental  identities  are  (1)  and  (2),  the 
latter  called  Sussman’s  identity,  with  Moyal’s  formula  (3)  following 
from  an  initial  value  theorem  of  Fourier  analysis,  and  Janssen’s 
equality  (6)  following  from  Poisson’s  sum  formula  and  an  initial 
value  theorem.  That  is.  Janssen’s  formula  is  a  sampled-data  version 
of  Moyal’s,  and  both  follow  from  Sussman’s  identity,  which  itself  is 
a  consequence  of  the  fundamental  identity  (1). 
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Introduction 

The  purpose  of  the  ISP  Program  (Integrated  Sensing  and  Processing)  is  to  develop  sys¬ 
tems  for  optimal  adaptive  cooperation  among  distributed  sensors  and  a  global  processing  unit 
(GPU).  This  cooperation  will  adapt  the  amount  of  computation/processing  done  at  local  sen¬ 
sors,  the  communication  bandwidth  to  the  GPU,  and  the  amount  of  computation/processing 
done  at  the  GPU. 

As  discussed  at  the  June  2003  ISP  Kickoff  Meeting,  one  important  research  component 
is  the  estimation  of  channel  statistics.  A  prerequisite  of  this  is  a  channel  model.  Hence, 
Section  1  of  this  report  is  devoted  to  the  discussion  of  a  multipath-Doppler  channel  model 
as  would  be  encountered  in  a  wireless  environment.  This  is  the  time- varying  communication 
environment  over  which  the  distributed  sensors  and  GPU  must  communicate, 

The  GPU  will  be  receiving  signals  from  many  sensors  at  the  same  time,  and  it  will  be 
necessary  to  isolate  the  signal  from  the  desired  sensor.  Four  different  models  for  the  channel 
seen  by  the  desired  user  and  the  channels  seen  by  interfering  users  are  discussed  in  Section  2. 
Our  contribution  here  is  to  show  that,,  even  when  the  interference  is  infinite  dimensional, 
the  analysis  of  case  2  can  be  reduced  to  case  1,  and  the  analysis  of  case  3  can  be  reduced  to 
case  4. 

Section  3  is  concerned  with  the  blind  estimation  of  the  interference-free  signal  subspace, 
which  naturally  arises  in  the  zero-forcing  detector  of  subspace  signals  in  subspace  interference 
and  noise.  This  section  extends  known  finite-dimensional  results  to  the  infinite-dimensional 
case. 

In  the  preceding  sections,  no  assumption  was  made  about  the  signaling  waveforms  of  the 
desired  user.  Section  4  shows  how  to  design  these  waveforms  to  maximize  the  average  mutual 
information  between  the  channel  coefficients  and  the  received  waveform.  It  is  shown  that 
the  optimum  waveforms  depend  only  on  the  covariance  matrix  of  the  channel  coefficients. 

Section  5  looks  at  GPU  design  when  each  sensor  can  transit  only  one  bit  of  infor¬ 
mation  about  its  measurement.  This  section  also  considers  an  approach  to  GPU  design 
when  each  sensor  is  allowed  to  send  a  multi-bit  word  of  information  about  its  measurement. 
Further  research  along  these  lines  should  allow  for  an  adaptive  system  in  which  the  sens¬ 
ing/computation  of  the  local  sensors,  the  information  they  transmit  to  the  GPU,  and  the 
processing  at  the  GPU  are  adjusted  based  on  time- varying  channel  capacity. 


October  5,  2003 
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1.  Multipath-Doppler  Channel  Models 

To  exploit  the  matched  gubspace  detectors  of  [8],  [9],  we  first  give  a  concise  derivation  of 
the  multipath-Doppler  model  of  [7]  in  Section  1.1.  Then  in  Section  1.2  wf  discuss  orthogonal 
projections  onto  the.  multipath-Doppler  subspace.  Interestingly,  the  cpiantities  appearing  in 
the  normal  ecpiations  can  be  expressed  in  terms  of  the  ambiguity  function  of  the  transmitted 
waveform  and  the  cross-ambiguity  function  of  the  received  waveform  and  the  transmitted 
waveform.  In  Section  1.3,  an  alternative  derivation  yields  a  different  model  parameterization. 

1.1.  First  Derivation 

In  this  section  we  first  derive  a  sampling  theorem  for  the  output  of  a  linear,  time- 
varying  system  when: 

1.  The  input  is  bandlimited  to  IT. 

2.  The  output  is  of  interest  only  during  the  finite  timffdnterval  [0,T], 

The  second  step  is  to  show  that  if  the  channel  causal  and  has  finite  multipath  spread  rm  and 
finite  Doppler  spread  Bj,  then  the  infinite  series  of  our  sampling  theorem  can  be  truncated 
with  negligible  error. 

Consider  a  time-varying,  linear  channel  model  of  the  form 

y(t)  =  J  h(t,r)x(t —  t)  cIt.  (1.1) 

Writing  this  convolution  as  the  inverse  transform  of  the  transforms,  we  have 

v(t)  =  I 

where  H  is  the  time- varying  transfer  function 

H(t,f)  :=  f  h(t,r)e-^dr, 

and 

X{f)  :=  J  x{r)e~j2nfT  dr. 

If  we  now  use  the  fact  that  the  signal  x  is  bandlimited  to  IT  and  that  we  are  interested  in 
y(t)  only  for  0  <  t  <  T,  then 

rw 

y(t)  =  /  H(t.f)X(f)e^‘'df.  0<t<T.  (1.2) 

J-W 

Since  the  foregoing  expression  involves  H(t,f)  for  (t,  f)  £  [0,T]  X  [—IT,  IT],  we  can  expand 

H  in  the  bivariate  Fourier  series 

mtf)  =  Y.T. 

k  i 

Substituting  this  expansion  into  (1.2)  yields,  for  0  <  t  <  T, 

rw 

=  V  V  //.  •  A2  '  '  /  X(f)e^^2WUf 

k  e  ’  J~w 

=  E  E  "  •  el2Kkt/Tx(t  -  i/m).  (1.3) 

k  i 
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To  complete  the  sampling  theorem,  we  now  analyze  the  formulas  for  the  coefficients  Hk,e  ■ 
Write 

=  ^  («,  ne-WIT <!»•'> I™  df  dt. 

Here  Z[o,r](f)  is  the  indicator  function  of  ting  set  [0,T];  i.e.,  I[o,T]{t)  =  1  for  t  £  [0,T]  and  is 
zero  otherwise.  We  regard  each  integral  as  the  transform  or  inverse  transform  of  a  product. 
Thus, 

1  poo  POO  (/ 

J^HUr)Bmc(2W[—-r\)drdt  (1.4) 

=  ^Tinc(2H'-[W-T])  jT  C(^, r) «nc(r[|  -  (L5) 

where 

C{v,t)  :=  j  h{t,T)e~]2lTlJi  dt  (1.6) 

is  called  the  scattering  function.  Notice  that  if  W  and  T  are  both  large,  then  the  sine 
functions  act  like  impulses,  and 


Hk,e  ~ 


C(k/TJ/2W ) 
2  ITT 


Hence,  we  call  the  Hk,c  the  scattering  coefficients. 

We  now  impose  further  assumptions.  If  the  system  in  (1.1)  is  causal,  then  h(t,  r)  =  0  for 
r  <  0,  and  so 

POO 

y(t)  =  /  h(t,r)x(t —  t)  cIt.  (1.7) 

Jo 

If  the  system’s  response  to  an  impulse  at  time  t0  is  finite  duration,  e.g.,  zero  for  t  >  t0  +  rm, 
then 

h{t,r)  =  0,  for  r  >  rm, 


and  (1.7)  becomes 

y(t)  =  /  h(t,r)x(t 

Jo 

Furthermore,  the  inner  integral  in  (1.4)  becomes 


r )  dr. 


h(t,r )  sine ^2 W 


i 

2W 


which  is  approximately  zero1  for  ('  <  —  1  and  for  ('  >  21TVm  +  1.  Taking  L  :=  [21TVm],  we 
have  Hk,e  ~  0  for  ('  <  0  and  ('  >  L.  We  next  assume  that  the  channel  has  finite  Doppler 
spread  B^',  i.e., 

C{v,t)  =  0 ,  for  | v |  >  Bj. 

The  convolution  is  significant  when  the  main  lobe  of  the  sine  is  touching  [0,T]. 
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Then  the  inner  integral  in  (1.5)  becomes 


[Bd  C,(I/,r)sinc(T[4  -  v])e-jKT{k/T-')  du. 

'  — Bd 


it 


This  convolution  is  approximately  zero  for  \k\  >  BjT  +  1.  We  therefore  put  M  :=  \BdT~\  so 
that  Hk,e  ~  0  for  \k\  >  M.  We  now  have  the  approximation  of  (1.3), 


M  L 

y(t)  ~  Y  Y 11  ‘ 

k=-M  £=0 


j27vkt/T 


x{t-e/2W),  0  <t  <T. 


The  importance  of  this  representation  is  that  the  only  unknowns  are  the  scattering  coefficients 
Hk,e-  The  values  of  M,  L,  T,  and  W  are  known.  This  is  to  be  compared  with  the  usual 
representation 

y{t)  ~  Y  Y  e32irVktx{t  -re),  0  <t  <T, 

k  e 

where  the  Doppler  frequencies  and  delays  T(  are  also  unknown.  Note,  however,  that  the 
number  of  terms  k  and  ('  in  this  representation  will  typically  be  less  than  2 M  +  1  and  L  +  1, 
respectively.  In  other  words,  the  linear  model  reduces  complexity  by  increasing  the  number 
of  coefficients. 


1.2.  Projections  onto  the  Signal  Subspace 

Based  on  the  foregoing  analysis,  when  the  bandlimited  waveform  x  is  sent  over  the 
multipath-Doppler  channel,  we  model  the  received  waveform  by 

M  L 

y(t )  =  Y  YHk,ee327Tkt/Tx(t-(!/2W)+n(t)1  0  /  / . 

k=—M  1=0 

where  n(t)  is  noise.  Consider  the  problem  of  projecting  y  onto  the  subspace 

S  :=  span{,s/,,y,  —M  <  />■  <  M ,  0  <  ('  <  T}, 


where 

skte{t)  :=  e^kt/Txit  -  f/21T),  0  <  t  <  T. 


The  usual  orthogonality-principle  argument  [5,  §3.6]  says  that  the  projection  y  is  given  by 


M  L  _ 

y{t)  y  "j  y  "j  i h,j  sk,c[t)i 

k=—M  C= 0 


where  the  coefficients  Hk,e  solve  the  normal  equations.  The  entries  of  the  Gram  matrix  are 
given  by  (.s^y,  Sk',e')i  and  the  components  of  the  “other  side”  of  the  normal  equations  are 
given  by 


(y-,  sk',c) 


J  y{t)sk’,e’{t)*  dt 
J  y{t)e-j27Tk't/Tx{t  -  f/2W)*  clt 
J  y{6  +  ? l2W)x\6)*e-i2*k'(e+t'l2W)IT  d6 

e-j2,k’l’l2WT  Ayx(k> /TJ> /2W), 
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where  the  *  indicates  complex  conjugation,  and  Ayx  is  the  cross-ambiguity  function, 

Ayx(v,r)  :=  J  y(t  +  T)x\t)*e~]2mt  dt. 

In  the  foregoing,  it  is  understood  that  y(t)  =  0  for  t  outside  [0,T].  The  entries  of  the  Gram 
matrix  are 

(sfe,f ,  sfe/,f/)  =  J  ej2wkt/Tx (t  -  e/2W)e~32Kk't/Tx(t  -  C/2W)*  dt , 


where  the  range  of  integration  is  [0,T].  If  the  temporal  support  of  all  the  x(t  —  t'/2W)  is 
essentially  contained  in  [0,T],  we  can  write2 

(sk,e,  sk,y,)  =  Jx(6  +  [f  -  (!]/2W)x(eye32^-k'W+e'/2W)/T  dO 

=  e-iMk'-k]t'/2WT  J  x(9  +  ^  _  e\/2W)x{9)*e-j27r[k'-k]6/T  dO  (1.8) 

=  e-j2^k'-kV'l2WTAxx{[k'  -  k\/T ,  [(:'  -  f]/2IT). 

If  the  assumption  about  the  temporal  support  of  the  x(t  —  t'/2W)  is  not  valid,  then  the 
integral  in  (1.8)  is  replaced  by 

rT-C/2W  r  ,  ,  , 

/  x{9  +  [f  -  t\/2W)x{9)*e~r2K[k  ~k]0/T  d9. 

J-e'/2W 

1.3.  Derivation  of  the  Second  Model 

The  derivation  in  Section  1.1  was  based  on  the  bivariate  Fourier  expansion  of  H(t,f) 
followed  by  the  finite  multipath  delay  and  finite  Doppler  spread  assumptions.  Alternatively, 
we  can  rewrite  (1.1)  in  terms  of  r)  defined  in  (1.6).  Thus, 


y{t) 


ej27Tlyt  dn 


x(t 


t  )  dr. 


If  we  now  assume  that  the  system  is  causal  with  finite  multipath  spread  rm  and  finite  Doppler 
spread  Hi .  we  have 


fTm  ["  fBd 

y(t)  =  /  C(is,T)ej27T1Jt  du  x(t  —  r)  dr.  (1.9) 

Jo  U  h.  J 

This  time  we  expand  C(v,  r)  in  a  bivariate  Fourier  series  on  \—Bd,  Bj]  X  [0,  rm].  Substituting 

C{^t)  =  EE  C'(,k  6  2 2 ^ / 2 2 jt A- r / rm 

into  (1.9)  yields 


C  k 


y(t)  =  EE  Gy 


e  k 


ej2™(f-f/2Bd)  du 


0  U-Bd 


•{t-  T)e32*kTlTm  dr 


=  2 Bd  ]T  E  sinc.(25d  t  -  — )  f  x(t  -  T)e32KkT/Tm  dr. 

fit  e-t  -D  rj  J  0 


l  k 


2 Since  the  nonzero  waveform  x  is  bandlimited,  it  cannot  be  time  limited. 
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In  this  last  integral,  substitute 


'(f-T)  =  X(f) 


j2tt /(f-r) 


df 


to  get 


y(t)  =  2 Bdjm  sinc.(25d  If  -  — -1) 

t  k  1Bd 

•  J  X{f)e-^T^-k'T^  sine  (rm[/  -  A;/rm]  )F2^  df 

=  2£drm  £  £  Cfy  ( ■ - 1  )k  sine  (25,  \t  -  J-] ) 

f  k  2Bd 

•  J  A"  ( / )  siiic  ( rm  [/  -  A:/rm])ej27r/(f_Tm/2)  df. 

For  rm  large,  the  last  sine  function  acts  like  a  delta  function  and  so 
3/(0  «  2BJ^^C(,t(-Dtsinc(2BJ[f  - 


(1.10) 


t  k 


=  2 Bd'£'ECW  smc(2Bd\t-  —  ])X(k/Tm)ej2"kt'T”'. 


e  k 


We  now  turn  to  the  coefficients  Ce,k-  Write 


Cc,k  = 


1 


fTm  fBd  C(iy  T)ej2lT(lJ/2Bde-j2lTkT/Tm  du  dr 
J0  J-Bd 


2  B(i  Tm 

1 

2  /f  /  Tn,  . 

1  fTm 


r0 


-j2nkr j Tm  yT 


Tm  Jo 

=  /  siiic(2f^ 


^£>d 


i 


sine 


2  Bd 


m 

( 

wd 


e-j2xkT/rm  (h 

(])[—  /  lp^TM.T)e-i**T,’*<lT 

L  7~m  J 

*])  [  f  11(1.  f)  sine (rm  [—  -  /])e-^*»(^m-/)  (l.H) 

L  J  Tm. 


dt 


Notice  that  for  large  rm  and  Bd,  the  sine  functions  act  like  impulses,  and  so 

H(('/2Bd,  k/rm ) 


a 


e,k 


2  Bd  Tm 


For  this  reason,  we  call  the  Ce,k  the  channel  coefficients.  If  we  assume  that  the  time- varying 
transfer  function  is  bandlimited  to  kF,  then  (1.11)  tells  us  that  Ce,k  ~  0  for  \k\  >  |"WYm]. 
Similarly,  if  we  restrict  t  to  [0,T]  in  (1.10),  we  see  that  for  ('  <  0  and  ('  >  \2BdTf  Ce,k  ~  0. 
Thus,  we  can  approximate  y(t)  by  using  finite  sums  in  (1.10).  So  far,  we  do  not  see  how  to 
exploit  this  representation. 
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2.  Subspace  Signals  in  Subspace  Interference  and  Noise 

Consider  a  receiver  whose  input  is 


y  =  a  +  b  +  n , 

where  a  is  the  desired  signal,  b  is  an  interference  signal,  and  n  is  white  noise.  It  is  assumed 
that  a  belongs  to  a  hnitf#;dimensional  subspace  A  and  b  belongs  to  a  subspace  B ,  which  may 
be  infinite  dimensional.  There  are  four  (Gaussian)  cases  to  consider  [10]: 

1.  a  and  b  are  deterministic  but  unknown. 

2.  a  is  deterministic  and  unknown,  but  b  is  random  and  independent  of  n. 

3.  b  is  deterministic  and  unknown,  but  a  is  random  and  independent  of  n. 

4.  a  and  b  are  random  with  a,  6,  and  n  independent. 

According  to  [10,  p.  2939],  all  except  case  3  have  been  discussed  at  length  in  the  literature 
when  a,  6,  and  n  are  finite-dimensional  vectors  in  Cm.  Case  3  (for  finite-dimensional  vectors) 
is  treated  in  [10,  Section  IV- C].  Here  w#  first  generalize  by  allowing  the  interference  to  lie  in 
an  infinite-dimensional  subspace.  We  then  show  that  case  2  can  be  transformed  into  case  1 
with  no  interference  term,  and  we  show  that  case  3  can  be  transformed  into  case  4  with  no 
interference  term.  Hence,  once  we  know  how  to  solve  cases  1  and  4,  those  solutions  can  be 
used  to  solve  cases  2  and  3,  respectively.  The  importance  of  these  transformation  is  that 
case  4  is  well  understood,  and  case  1  has  recently  been  solved  for  constraints  on  a  and  with 
B  being  infinite  dimensional  [3]. 

2.1.  Model  Details 

We  take  a  to  be  of  the  form 


p 

a  =  A  u  :=  ukak, 

k=l 

where  ai, ...  ,ap  are  linearly  independent.  Thus,  a  lives  in  the  finite-dimensional  subspace 

A  :=  span{ai, .  . .  ,  ap}, 

and  u  :=  [t<i, .  .  .  ,up\'  is  either  deterministic  and  unknown  or  random  with  zero  mean  and 
covariance  matrix  Ru.  When  b  is  deterministic,  it  is  allowed  to  lie,  for  example,  in  a  closed, 
infinite  dimensional  subspace  B  of  T2[0,T].  When  b  is  a  zero- mean  random  process  with 
covariance  function  r^WG),  the  process  is  assumed  to  have  a  Karhunen-Loeve  expansion, 

e-g-, 

OO 

b{t)  =  Bksk{t)r 

k=l 

where 

Bk  =  f  b{t)sk{t)*  dt,  (2.1) 

Jo 

the  sk  are  orthonormal,  the  Bk  are  independent,  and  J0T  ?y(t,  r  )s/.(t)  dr  =  (3ksk(t).  Here  we 
take 

B  =  >pan ,. 
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We  have  recently  solved  case  1  for  infinite-dimensional  B  in  [3].  In  case  4,  linear  estimation 
of  a  (or  u)  is  a  standard  Wiener-filter  problem.  If  Karhunen-Loeve  expansions  are  valid, 
then  detection  is  also  straightforward.  We  next  show  that  case  2  can  be  reduced  to  a  special 
version  of  case  1  in  which  B  is  the  zero  subspace.  Similarly,  we  show  that  case  3  can  be 
reduced  to  a  special  version  of  ca.sp  4  in  which  there  is  no  interference. 

2.2.  Reduction  of  Case  2  to  Case  1 

The  first  step  is  to  write  y  =  a  +  b  +  n  as  y  =  a  +  w,  where  w  :=  b  +  n.  The  covariance 
function  of  w  is 

lit 2)  =  K)  +  cr2S(ti  —  t2)i 

where  8  is  the  Dirac  delta  function.  The  corresponding  covariance  operator  is  Rw  :=  f?&  +  <T2/, 
where  /  is  the  identity  operator,  and 

fT 

(Rbx)(t)  :  =  /  r&(t,  r )x(t )  dr. 

Jo 

The  basic  idea  is  that  observing  y  is  equivalent  to  observing 

y  ■=  Ru,1/2y  =  a  +  n, 

where  a  :=  R~R2ci  and  n  :=  R~R2w.  Since  h  is  white  nois%  and  since  a  =  Au,  where 
A  :=  R~1/2A,  we  see  that  y  =  Au  +  h  is  exactly  case  1  with  no  interference. 

It  remains  to  check  a  few  details  such  as  the  existence  of  R~R2  and  the  fact  that  n  is 
white  noise. 

To  find  determine  R~R21  we  proceed  as  follows.  Assuming  that 

fT  fT 

/  /  |?y(t,  t)|2  dt  dr  <  00, 

Jo  Jo 

it  follows  that  Rb  is  a  compact  operator  [2,  pp.  86-87]. 3  By  the  spectral  theorem  for  compact, 
self-adjoint  operators  [2,  p.  113],  Rb  has  the  representation 

OO 

Rb$  ^  ]  h,-  A  - 

k=i 

where  l  lie  fik  are  positive,  nonincreasing  eigenvalue#  with  corresponding  orthonormal  eigen¬ 
vectors  Sk’i  i.e.,  Rb-Sk  =  ftk.Sk-  Furthermore,  we  have  the  orthogonal  direct  sum  decomposition 
[2,  p.  115] 

T2[0,T]  =  ker  0  £>, 

where  B  was  defined  above.  For  x  G  T2[0,  T],  we  can  always  write  x  =  x  +  x,  where-  x  is  the 
projection  of  x  onto  ker  Rb,  and 

OO 

-i  =  J2(x'sAsk 

k=i 

3To  guarantee  the  existence  of  the  Karhunen-Loeve  expansion,  we  need  rb  to  be  continuous.  Note  that, 
continuity  of  rb  is  enough  to  make  Rb  compact.  Continuity  of  rb  is  assured  if  we  assume  b(t.)  is  mean-square 
continuous. 
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is  the  projection  of  x  onto  B.  Thus, 

Rwx  =  (a2 1  +  Rb)x 

OO 

c  x  T  ^  '  {3k(x,  Sk'jsk  (2.2) 

k=l 

=  a2(x  +  x )  +  flk(x,  Sk)sk 
k= l 

=  cr2.f  +  ^{cr2  +  (3k)(x,  Sk)sk- 

k= l 

This  suggests  that 

OO 

Rw2'x  :=  vx  T  T  {x  ■>  sk)sk  (2-3) 

A=1 

should  be  self-adjoint  and  satisfy  R1J2(R1J2x)  =  Rwx.  This  is  easily  checked  to  be  the  case. 
Next,  it  is  an  easy  exercise  to  check  that 

0-1  -2  ,  V2'  -No2 

Rwy  =  -  *  + 

Then,  just  as  (2.2)  led  to  (2.3),  we  easily  find  that 

OO 

Rj/2y  =  a  1y  +  +  fik)  1/2(y,sk)sk,  (2.4) 

k= i 


where  y  is  the  projection  of  y  onto  ker  Rk. 

It  remains  to  show  that  h  :=  R~R2io  is  white  noise.  Write 


n(t)  =  a  1iu{t) +  J2((j2  +  fjk)  1/2(w,sk)sk{t) 
k= 1 

OO 

=  1 7  1n{t)  +  ^(cr2  +  /3k)  1/2{Bk  +  Nk)sk{t), 

k= l 

where  Bk  was  dehned  in  (2.1),  Nk  is  dehned  similarly,  and 

OO 

n(t)  =  n(t)  -  Nksk{t). 
k= 1 

Using  the  formula 


E[h(ti)h(t2)*]  =  o-2£(U  -  h)  ~  o-2  ^k{h)sk{t2)* 

k= 1 

along  with  E [BkN*]  =  0,  E [BkB*]  =  (3k8ki ,  and  E[AhW*]  =  o,28ki,  it  is  straightforward  to 
verify  that  E[??. (U  )n{t2  )*]  =  8(ti  —  t-2).  Hence,  n  is  white  noise. 
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2.3.  Reduction  of  Case  3  to  Case  4 

Let  P)sa  denote  the  projection  of  a  onto  B ,  and  let  P$a  denote  the  projection  of  a  onto 
B1.  Then  y  =  a  +  b  +  n  can  be  written  as 

IJ  =  Pit a  T  ^  T  ni 

where  b  :=  P^a  +  b.  In  case  3,  b  is  an  unknown  element  of  B ,  and  so  our  model  is  equivalent 
to 

y  =  Pi  a  +  6  + 

where  b  is  just  another  unknown  element  of  B.  Put  Q  :=  Pi  (A).  Since  Q  C  H1,  Pgb  =  0. 
Since  a  £  A ,  Pi  a  £  Q,  and  so  PgP$a  =  Pi  a.  Hence, 

~  :=  Pay  =  Pia  +  v , 

where  v  :=  Pgn.  Notice  that  ~  and  y  —  z  =  b  +  are  uncorrelated  and  therefore 

independent.  Also,  y  —  z  is  independent  of  a  and  of  v.  Hence,  there  is  no  loss  of  information 
about  a  if  we  work  with  ~  instead  of  y.  To  conclude,  note  that  since  A  is  finite  dimensional, 
so  is  Q .  Hence,  instead  of  working  with  ~,  we  can  work  with  its  coordinate  vector  relative  to 
some  orthonormal  basis  of  Q.  Denote  this  coordinate  vector  by  s  and  similarly  for  v.  Since: 
the  covariance  matrix  of  y_  is  cr2/,  z_  is  a  version  of  case  4. 

Remark.  If  n  is  Gaussian,  and  if  whenever  a  and/or  b  are  random  they  are  also  Gaussian, 
then  since  all  transformations  here  are  linear,  Gaussianity  is  preserved. 
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3.  Blind  Identification  of  the  Interference-Free  Signal  Subspace 
and  the  Signal  Coefficient  Covariance 

Consider  a  signal  detection  problem  in  which  the  received  waveform  is 

y{t)  =  a(t)  +  b(t)  +  n(t),  0  <  t  <  T, 

where  a  is  the  random  signal  to  be  detected,  b  is  a  random  interference  process,  and  n  is  a 
white  noise  process.  If  a  belongs  to  a  subspace  A,  and  b  belongs  to  a  subspace  B ,  then  a  key 
step  in  designing  suboptimal  linear  detectors  for  CDMA  systems  [12],  and  more  generally, 
matched  subspace  detectors  [8],  [9],  is  the  characterization  of  the  subspace  P^[A)  that  results 
from  projecting  the  elements  of  A  onto  the  orthogonal  complement  of  B.  The  problem  of 
finding  P$[A)  when  B  is  unknown  was  studied  by  Scharf  and  McCloud  [10]  when  a,  6,  and 
n  were  finite-dimensional  random  vectors.  Here  we  allow  a,  6,  and  n  to  be  waveforms,  and 
we  specifically  allow  B  to  be  infinite  dimensional. 

3.1.  System  Model 

We  assume  that  the  signal  is  of  the  form 

p 

a{t)  =  ( Au){t )  :=  J2ukak{t),  (3.1) 

k=i 

where  «i , .  .  . ,  ap  are  linearly  independent  waveforms  in  L2[ 0,  T],  and  u  :=  [t<i, . .  . ,  uv\'  is  a 
random  vector  in  Cp.  In  other  words,  the  operator  A  takes  the  random  column  vector  u  and 
returns  the  random  waveform  Au,  which  lives  in  the  p-dimensional  subspace  of  waveforms 

A  :=  spanjhi, .  . .  ,  ap}. 

For  the  random  subspace  signal  a(t),  its  covariance  function  is  easily  seen  to  bf 

p  p 

ra{ti,h)  ■=  E[a(ii)a(i2)*]  =  ^ 

k=i  e=i 

where  Ru  is  the  covariance  matrix  of  the  random  vector  u.  The  covariance  operator  corre¬ 
sponding  to  the  covariance  function  ra  is 

{Rax){t)  :=  f  ra(t,  t  )x(t  )  t/r,  0  <t  <T. 

Jo 

A  simple  calculation  shows  that 

Rax  =  ARUA*  x ,  (3.2) 

where  the  adjoint  operator  A*:  T2[0,T]  — >  Cp  is  given  by 

(,r,ai) 

A*x  =  |  , 

.  (x,aP)  _ 
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and  (•  ,  •}  denotes  the  standard  inner  product  on  A2[0,T], 

fT 

(x,a)  :=  /  x(t)a(t)*  dt. 

Jo 

We  assume:  that  Ru  is  invertible^. 

Suppose  that  the  interference  b  is  a  second-order  process  with  covariance  function  ?y(H,  t2) 
E[b(ti)b(t2)*]  and  corresponding  covariance  operator 

( Rbx)(t )  :=  f  t  )x(t  )  dr,  0  <t<T. 

Jo 

It  is  easy  to  see  that  Rb  is  self  adjoint.  Next,  if 


rb(t,  t)|2  dt  dr 


<  oo, 


then  Rb  is  compact  [2,  pp.  86-87].  By  the  spectral  theorem  for  compact,  self-adjoint  operators 
[2,  p.  1 13],  Rb  has  the  representation 


Rb ^  ]  fJk  ( ^  i  ^k) Ski 

k=l 

where  the  (3k  are  positive,  nonincreasing  eigenvalues  with  corresponding  orthonormal  eigen¬ 
vectors  Sk]  i.e.,  Rb-Sk  =  (3kSk-  Furthermore,  we  have  the  orthogonal  direct  sum  decomposition 
[2,  p.  115] 

L2[0,T]  =  her  0  H,  (3-3) 

where 

B  :  =  spanjA/.,}, 

and  the  overbar  denotes  the  closure. 

If  b  is  in  fact  zero-mean,  mean-sc{uare  continuous,  then  b  has  the  Karhunen-Loeve  ex¬ 
pansion 

OO 

b{t)  =  BkSk{t)i 

k=i 

where 

Bk  =  f  b{t)sk{t)*  dt. 

Jo 

However,  this  is  more  than  we  need.  All  we  really  need  is  (3.3). 

3.2.  Identification  of  Pj${A) 

An  important  cpiantity  in  the  design  of  matched  subspace  detectors  is  thp  subspace 

Q  Pg  {A)  =  span{Pg  c<i, .  . .  ,  c<p},  (3.4) 

where  Pg  is  the  projection  onto  the  orthogonal  complement  of  B.  We  assume  A  fl  B  =  {0} 
(the  zero  subspace);  this  condition  is  necessary  and  sufficient  to  guarantee  that  th e  P^ak  are 
linearly  independent,  and  heiic#4hat  dimt?  =  p.  How  can  we  find  Q  if  we  do  not  know  B1 
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Let  us  suppose  that  we  can  estimate  the  covariance  function  ry(ti,t2)  from  observed  data. 
Assume  that  we  know  the  a^,  cr2,  and  that  a,  6,  and  n  are  uncorrelated.  Then 


Put 


Then  we  know  R. 


Ry  ~  Ra  +  Rb  +  Cr2I. 

R  '.=  Ra  T  Rb  =  Ry  —  u2 1 . 


Theorem  3.1.  The  operator  R:  L2[0,T]  — >  T2[0,T]  maps  th&  subspace  Q  one-to-one  and 
onto  the  subspace  A.  Furthermore ,  Q  =  R+(A),  where  R+  denotes  the  pseudoinverse  of  R.4 
It  then  follows  that 

Q  =  span{i?+ai, .  .  . ,  R+ap}. 

Remark.  The  point  here  is  that  by  knowing  the  covariance  function  ry(t,r),  the  noise 
variance  <r2,  and  the  basis  waveforms  a y  in  (3.1),  we  can  obtain  a  basis  for  Q.  Having  a  basis 
for  (?,  we  can  project  any  waveform  onto  it  by  solving  the  appropriate  matrix-column- vector 
normal  equations.  It  is  not  necessary  to  know  B. 

Proof  of  Theorem  3.1.  We  first  show  that  for  g  £  (?,  Rg  £  A.  Since  Rg  =  Rag  +  Rbg,  and 
since  for  all  x,  Rax  £  A ,  it  suffices  to  show  that  Rbg  =  0.  Now,  g  £  Q  C  B1.  It  follows 
from  the  orthogonal  decomposition  (3.3)  that  B 1  =  ker  thus,  for  g  £  Q,  Rbg  =  0.  Hence* 
Rg  =  Rag  £  A  as  claimed. 

Since  dimly  =  dim  A  <  oo ,  if  we  can  provei?  is  one-to-one  on  Q,  then  by  [4,  p.  81,  Th.  9], 
it  follows  that  R  maps  Q  onto  A.  We  proceed  as  follows  to  show  R  is  one-to-one  on  Q .  For 
u,v  £  (Dp,  put  gi  =  PjfAu  and  g-2  =  PjfAv,  and  suppose  Rgi  =  Rg-2 .  As  just  noted,  this 
implies  Ragi  =  Rag-2-  Using  (3.2),  we  have 

ARu{A*P,fA)u  =  ARu{A*PjfA)v. 

Since  A  is  one-to-one  and  Ru  is  assumed  invertible,  we  have 

{A*P£A)u  =  (A*P£A)v. 

Since  A*PjfA  =  (PjfA)*(PjfA),  and  since  the  Pjfcik  are  linearly  independent,  PjfA  is  one- 
to-one,  and  then  so  is5  ( Pjf  A)* ( Pjf  A) .  Thus,  u  =  v  and  then  gi  =  g2. 

Since  R  maps  Q  one-to-one  and  onto  A,  there  is  an  inverse  map  from  A  onto  Q\  i.e.,  for 
every  g  £  Q  there  is  exactly  one  a  £  A  such  that  Rg  =  a.  If  g  £  (keri?)1,  then  g  =  R+a. 
We  now  show  that  if  g  =  PjfAu  for  some  u,  then  g  is  orthogonal  to  ker  R. 

4For  a  bounded  operator  R ,  the  domain  of  R+  is  the  set  of  all  y  £  L2[0,T]  such  that  the  projection  of  y 
onto  the  range  of  R  exists.  Denoting  this  projection  by  y,  R+ y  is  defined  to  be  the  minimum-norm  solution 
of  Rx  =  y.  By  definition  of  y,  it  is  in  the  range  of  R,  and  so  there  is  at  least  one  solution  xo  such  that 
Rx o  =  y ■  Since  R  is  a  bounded  operator,  ker  R  is  closed.  Since  L2[0,T]  is  complete,  the  projection  theorem 
[5,.  §3.3]  shows  that  L2[0,T\  =  kerf?  0  (ker  f?,)U  It  is  then  easily  seen  that,  the  projection  of  xq  onto  (ker  f?)2- 
is  the  unique,  minimum-norm  solution.  We  shall  be  concerned  with  the  case  in  which  y  is  already  in  the 
range  of  R  so  that  y  =  y.  For  such  y,  R+  y  is  defined. 

5Use  the  easily  verified  fact  that  for  any  operator  D  with  adjoint  D* ,  ker  D  =  ker  D*D. 
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Fix  any  x  £  ker  R  so  that  Rx  =  0.  Then  Rax  =  —RbX  is  an  element  of  A  fl  B  =  {0}. 
Thus,  Rax  =  Rf,x  =  0.  In  particular,  R^x  =  0  implies  x  £  B1.  Now  write 

{g,x)  =  (PgAu,  x)  =  (Au,Pgx)  =  (Au,x)  =  (u,A*x), 

which  equals  zero  because  0  =  Rax  =  ARuA*x  implies  A*,r  =  0  on  account  of  the  linear 
independence  of  the  and  the  invertibility  of  Ru.  Thus,  Q  C  (keri?)1.  □ 

We  now  turn  to  the  problem  of  finding  Ru. 

Theorem  3.2.  The  covariance  matrix  Ru  can  he  obtained  via . 

Ru  =  {A*R+A)~1. 

Proof.  It  suffices  to  show  that  A* R+  ARU  =  /,  which  we  do  by  showing  that  A* R+  ARuv  =  v 
for  arbitrary  v  £  Cp.  Given  v  £  Gp,  ARuv  £  A.  By  Theorem  3.1,  R+  ARuv  £  Q.  Hence, 

R+ARuv  =  P$  Au ,  for  some  u  £  Gp.  (3.5) 

From  the  definition  of  pseudoinverse,  RPjfAu  is  equal  to  the  projection  of  ARuv  onto  the 
range  of  R.  Since  ARuv  £  A.  which,  by  Theorem  3.1,  h  a  subset  of  the  range  of  //. 
RPjfAu  =  ARuv.  Now  write 


ARuv  =  RPjfAu 
=  RaPjfAu 
=  ARUA*  P/fAu. 

Since  A  is  one-to-one,  and  since  Ru  is  assumed  invertible,  v  =  A*PjfAu.  Using  (3.5),  write 

A*  R+  ARuv  =  A*  P/fAu  =  v.  □ 
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4.  Waveform  Sets  with  Maximum  Mutual  Information 

Consider  the  received  waveform 

V(t)  =  (Au)(t)  +  n(t), 

where' 

{Au){t)  :=  J2ukak{t)-  (4.1) 

k=i 

We  assume  that  the  waveforms  a k  are  linearly  independent  and  that  u  :=  [t<i, .  .  . ,  uv\'  is 
N{0,Ruu)  (Ruu  positive  definite)  and  independent  of  the  zero-mean,  white  Gaussian  noise 
n(t)  with  power  spectral  density  Sn(f)  =  cr2.  Our  goal  is  to  choose  the  waveforms  a k  so  as  to 
maximize  the  average  mutual  information  between  u  and  y,  denoted  by  I(u  A  y).  The  idea 
is  to  choose  waveforms  cik  so  that  the  received  waveform  y  provides  as  much  information  as 
possible  about  the  weights  u/.. 

Direct  evaluation  of  /( u  A  y)  is  complicated  by  the  fact  that  u  is  a  column  vector  and  y  is 
a  continuous-time  random  process.  In  Section  4.1,  we  show  that  I(u  A  y)  =  I(u  A  v)  where  v 
is  a  column  vector  related  to  y.  It  will  be  seen  that  u  and  v  are  jointly  Gaussian.  Hences  the 
joint  distribution  of  u  and  v  is  completely  determined  by  the  covariance  matrices  Ruu  and 
Rvv  and  the  cross-covariance  matrix  Ruv.  In  Section  4.2,  Ruv  and  Rvv  are  obtained.  Then 
I(u  A  v)  is  expressed  in  terms  of  known  quantities  in  Section  4.3.  Finally,  in  Section  4.4,  the 
optimization  problem  is  formulated  and  solved  in  closed  form. 

4.1.  Reduction  to  a  Finite-Dimensional  Problem 

Let  A  :=  span{ai, .  .  . ,  ap},  and  let  P4  denote  the  projection  operator  onto  A.  Since 
y  =  Au  +  n,  where  Au  £  A.  we  have 

V  ■=  PaV  =  Au  +  PAn 


and 

y  ■=  y-y  =  n-PAn  =  p^n, 

where  Pj^  denotes  the  projection  onto  A1,  the  orthogonal  complement  of  A.  Since  y  and  y 
are  uncorrelated  and  jointly  Gaussian,  they  are  independent. 

Using  the  fact  that  the  mapping  y  i-A  (y,  y  )  is  invertible,  along  with  a  standard  identity, 
we  have 

I{uA  y)  =  I(u  A  (y,  y  ))  =  /( u  A  y  )  +  /( u  A  y  \y  ). 

To  see  that  this  last  term  is  zero,  observe  that 

I(uAy\y)  =  H(y  \y  )  —  H(y\y,u) 

=  H(y  )  —  H(y  ),  by  independence, 

=  0. 


Thus, 


I  (u  Ay)  =  I  ( u  A  y  ) . 


(4.2) 
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We  next  show  that  the  waveform  y  is  equivalent  to  the  column  vector 


A*y 


where  (•  ,  •)  denotes  the  inner  product, 

(y,a) 


(y,a  1) 
(y,  (1p) 


J  y(t)a(t)*  dt. 


By  equivalent  we  mean  that  y  can  be  obtained  as  a  function  of  A*y  and  A*y  can  be  obtained 
as  a  function  of  y.  To  see  this,  first  recall  that  Py i  =  A{A* A)-1  A*  [5,  pp.  160-161].  Thus, 
y  =  Pjyy  is  a  function  of  A*y.  Conversely, 

A*y  =  A*  A{A*  A)~l  A*y  =  A*y. 


Since  y  and  A*y  are  equivalent, 


I(uAy)  =  I(uAA*y).  (4.3) 

To  conclude  this  section,  we  define  v  in  terms  of  A*y  by  the  invertible  transformation 

v  :=  (A*A)~1/2A*y. 

Hence,  I(u  A  A*y)  =  I(u  A  v).  Putting  this  together  with  (4.2)  and  (4.3),  we  have  I(u  A  y)  = 
I(u  A  v)  as  required. 

4.2.  The  Joint  Distribution  of  u  and  v 

Observe  that 

v  :=  {A*A)~1/2A*y  =  {A*A)1/2u  +  w, 

where  :w  :=  [A* A)~1/2 A*n  is  a  p- dimensional  A^(0,cr2/)  random  vector.  In  now  follows  that 
u  and  v  are  jointly  Gaussian  with 

Ruv  =  Ruu{A*  A)l/2  (4.4) 

and 

=  (.4*.4)1/2fl„(.4*.4)1/2  +  <T2/ 

=  (V.4)1/2R;i2[/  +  S'-1]i?,1/2(.4*.4)1/2,  (4.5) 

where  S  is  the  SNR  matrix, 

S  :  ,4.6, 

a2 
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4.3.  Evaluation  of  the  Average  Mutual  Information 

An  easy  calculation,  e.g.,  [11,  Section  V],  shows  that  the  average  mutual  information 
between  u  and  v  is  ^ 

I (u  A  v)  =  —  [In det  Rvv  —  In det  Qvv\ , 

where  [11,  eq.  (20)] 


Q  w  •  R  v  v  I  *  .  C  I  *  ,  . 

=  RlJ*(I-C'C)Rll\ 


and 

P  :=  K,l (4-7) 

is  the  coherence  matrix.  It  follows  that 


/  ( t<  A  v )  =  -fin  det  (I-C'C)  =  In  det  (I  -  CC'). 

Using  (4.7)  along  with  (4.4)  and  (4.5),  it  is  easy  to  check  that  CC  =  (I  +  »5'-1)-1.  Using 
the  matrix  inversion  formula  [8] , 


i-cc  =  i  -  [i  -  (/  +  s)-1}  =  ( i  +  s r1. 


Thus, 


/( u  A  v )  =  4  In  det  (I  +  ,5' ) . 


4.4.  The  Optimization  Problem 

It  is  instructive  to  consider  the  case  in  which  p  =  1  in  (4.1).  Then  the  SNR  matrix  S  in 
(4.6)  is  the  scalar  SNR., 


It  is  then  clear  that  I(u  A  v)  is  monotonic  increasing  with  A* A  =  ||«i||2-  Hence,  in  order 
to  have  max^  I(u  A  v)  finite,  w®'  must  impose  some  kind  of  constraint  on  the  energy  of  the 
signaling  waveforms. 

Let 

L  :=  RlJXA-A)R'J* 

denote  the  numerator  in  (4.6)  so  that  S  =  L / a2 .  Consider  till;  problem 


s  = 


R„ 


(7Z 


max  4  In  det  (I  +  L/c r2) 


subject  to  tv  L  <  5, 


(4.8) 


where  £  is  a  constraint  on  the  allowable  energy  of  the  signaling  waveforms.  Now  observe 
that  for  any  orthogonal  matrix  P  (P'P  =  PP1  =  /),  (4.8)  is  equivalent  to 

max  4  In  det (/  +  P'LP/a2)  subject  to  tv(P'LP)  <  £. 

Hence,  if  P  is  chosen  to  diagonalize  L,  then 


P'LP  =  A  =  diag(Ai, . . . ,  Ap), 
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and  (4.8)  becomes 


y  P  P 

max  —  E  ln(  1  +  A*, /cr2 )  subject  to  E  ^  <  c 

A  2  k= i  k= i 

where  l  lie  A.,  are  the  eigenvalues  of  L.  Using  Lagrange  multipliers,  it  is  an  easy  exercise  to 
show  that  the  optimal  choices  for  the  A*,  are  A*,  =  £ /  p  for  all  k  and  that  I(uAv)  =  (p/2)  ln(l  + 
S/pcr2).  It  is  easy  to  see  that  if  A* A  =  ( £/p)R~ /,  then  L  =  R\/2{ A*  A)R\/2  =  ( £/p)I  is 
diagonal  with  all  eigenvalues  equal  to  £/p.  To  conclude,  we  observe  that  if  ai,...,ap  is 
any  orthonormal  basis  of  waveforms  with  corresponding  operator  A  given  as  in  (4.1),  then 
A  :=  \J £ /p  AR~}J 2  solves  the  problem.  In  other  words,  we  should  take 

V  p  i= i 


Remark.  In  general,  for  any  operator  A  defined  similarly  to  (4.1),  if  we  take  A  :=  AR~AJ2 , 
then 

y  =  Au  +  n  =  AR~1/2u  +  n. 

If  we  put  u  :=  R~V2u,  then 

y  =  Au  +  n, 

where  the  covariance  matrix  of  u  is  /.  In  other  words,  we  are  whitening  the  input. 

4.5.  Discussion  of  the  Constraint  in  (4.8) 

It  is  rather  obvious  that  we  used  the  constraint 


trl  =  tr(B!/,2(.4*.4)B!/,2)  <  £ 


to  make  (4.8)  easy  to  solve  in  closed  form.  Here  is  a  different  approach  to  the  problem  that 
appears  to  use  the  more  natural  constraint 


tr(,4U4)  =  E  IK||2  <  £■ 

k=l 


Rewrite  y  =  Au  +  n  as 


V  =  ARWRrWu  +  n. 


If  we  put 

4  :=  AR\P  (4,9) 

and  u  :=  R~AJ2u,  then  we  have  the  model  y  =  Au  +  ??.,  where  now  u  has  covariance  matrix 
/.  Since  u  and  u  are  related  by  an  invertible'  transformation,  I(u  A  y)  =  I(u  Ay).  We  would 
then  say  that  there  is  no  loss  of  generality  in  taking  Ruu  =  /  and  proceed  as  above  and 
impose  the  apparently  natural  constraint  tr(.4U4)  <  £.  However,  this  is  misleading  because 
the  foregoing  A  is  actually  A  in  (4.9).  tlsing  (4.9),  tr(.4U4)  =  tv(R}J2 (A* A)R}J2) . 
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4.6.  Future  Work 

One  obvious  extension  is  to  replace  the  constraint  in  (4.8)  with  the  more  natural  con¬ 
straint 

tr(A*,4)  =  ^  1 1 « A' 1 1 2  <  £■ 

k=l 

However,  this  seems  rather  challenging. 

In  a  multipath  channel,  the  received  waveform  would  again  be  y  =  Au  +  n,  but  now 
the  a  a-  would  have  the  form  tu-(f)  =  —  {k  —  1  )T),  k  =  1, .  .  | ±p,  where  T  is  proportional 

to  the  reciprocal  of  the  bandwidth  of  the  basic  waveform  a(t),  and  p  is  proportional  to  the 
product  of  the  bandwidth  and  the  channel  multipath  spread  as  in  Section  1.  In  this  case,  the 
foregoing  analysis  does  not  immediately  apply  since  there  is  now  the  additional  constraint 
that  the  a  a  be  shifts  of  a  basic  pulse  a. 
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5.  Fusion  of  Decentralized  Decisions 

5.1.  A  Preliminary:  Detection  with  an  Arbitrary  Discrete  Random  Variable 

Let  Z  be  a  discrete  random  variable  taking  N  distinct  values  with  positive  probability 
under  each  of  two  hypotheses.  Let  Z  denote  the  set  of  N  values  taken  by  Z,  and  let 


'  p0(z  =  zy 


(5.1) 


denote  the  likelihood  ratio  of  Z .  Without  loss  of  generality,  we  enumerate  the  elements  of 
2,  say  ~|j.  .  . ,  ~jv,  so  that 

L{~ i)  <  •••  <  L(zn). 

If  i]  =  L{zk)  >  L(zk-i),  then  the  probability  of  false  alarm  is 


N 


Pfa{v)  =  Po(L(Z)>V)  =  J2^Jo{Z  =  Zi) 

i=k 


(5,2) 


Poll))  =  Pl(L(Z)>V)  =  £P,(Z  = 

i=k 

Note  that  pfa  and  pf>  are  nonincreasing,  left-continuous  functions  of  ?/.  In  fact,  these  func¬ 
tions  are  piecewise  constant  with  jumps  at  the  values  ?/  =  L(zk). 

5.2.  Fusion  of  Local  Binary  Decisions 

Consider  a  collection  of  n  decentralized  sensors,  each  making  its  own  binary  decision 
Di  =  0  or  1  about  whether  the  underlying  hypothesis  is  0  or  1.  At  a  fusion  center,  the 
decentralized  decisions  are  collected  into  the  discrete  random  variable  Z  :=  [Di, .  .  . ,  Dn\' 
taking  N  =  2n  distinct  values.  The  centralized  decision  of  the  fusion  center  may  then  be 
treated  as  in  Section  5.1. 


5.3.  Quantization  for  ROC  Approximation 

Let  Y  be  a  random  vector  taking  values  in  IlC.  and  let  Ai, . .  . ,  An  be  a  partition  of  IlC. 
Put  Z  =  Zk  if  Y  G  Ak  so  that  Z  is  a  discrete  random  variable  as  above.  If  the  sets  Ak  satisfy 
Pq{Y  G  Ak)  =  Ao  for  some  small  Ao,  then  the  ROC  curve,  which  plots  Pd{v)  versus  Pfa{v)-> 
will  have  points  closely  spaced  on  the  horizontal  axis. 

As  an  example,  consider  a  real-valued  random  variable  Y  with  cumulative  distribution 
function  Fi(y)  =  Pi{Y'  <  y),  i  =  0,1.  Put  Ao  =  1/V,  and  for  k  =  1,...,  N  —  1,  let  ijk 
solve  F0{ijk)  =  kAa.  Put  Ak  :=  {yk-i,yk\,  where.  y0  :=  -oo,  and  AN  =  {yN- i,oo).  Then 
Po(Z  =  Zk)  =  Ao.  If  (5.1)  does  not  hold,  it  may  be  necessary  to  renumber  the  Zk-  In  any 
case,  the  distinct  values  of  Pfa{v)  in  (5.2)  will  be  spaced  Ao  apart. 

5.4.  Quantization  for  Likelihood  Ratio  Approximation 

Let  Yj  be  a  continuous- valued  measurement  available  at  sensor  j,  j  =  1 .  Let 
L(y i, .  .  .yijn)  denote  the  likelihood  ratio  based  on  the  joint  distribution  of  the  Yj.  Assume 
each  sensor  is  equipped  with  a  fine  partition  of  intervals.  If  sensor  j  observes  Yj  G  {aj,k3 , 
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it  transmits  only  the  index  kr  Since  the  fusion  center  cannot  evaluate  L(Y\, .  . . ,  1'^),  it  uses 
the  value 

j  (al,ki  +  an,k„  +  bn,k„\ 

v  2  ’  ’  ’  ’  ’  2  ' 

5.5.  Many  Independent  Sensors 

Let  Ak  be  such  that 


&-k  ■—  Oo(I'a  £  Ak)  <  £  Ak)  [3k- 

Such  a  test  is  said  to  be  unbiased.  Put 

1  n 

Xn  :=  Y„), 

n  k=i 

and  observe  that 

i  n  -j  n 

E0[A"n]  =  <  ~^2fh  =  Ei[Xn], 

n  k= i  n  k= i 

If  for  some  t  >  0  and  some  ?/,  E0[A"n]  <  i)  —  £  and  ?/  +  t  <  Ei[A"n]  for  all  n,  then  under 
reasonable  conditions  a  suitable  weak  law  of  large  numbers  will  hold  so  that 

&o{Xn  >?/)—>  0  and  P\{Xn  >?/)—>  1. 

In  other  words,  asymptotically,  as  long  as  each  sensor  doesn’t  do  too  badly,  the  fusion 
center  can  obtain  arbitrarily  small  probability  of  false  alarm  and  arbitrarily  high  probability 
of  detection.  Of  course,  for  finite  n,  we  want  to  make  E0[A"n]  and  Ei[A"n]  far  apart.  For 
example, 

1  n 

Ei[Xn]  -  E0[An]  =  -  J2([3k  ~  otk). 

n  k=i 

If  each  Ok  is  fixed,  then  each  (3k  can  be  maximized  by  choosing  Ak  according  to  the  Neyman- 
Pearson  Lemma. 
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Abstract 

The  concept  of  integrating  sensing  with  processing  for  general  communication  problems  is  devel¬ 
oped.  The  fundamental  notion  of  varying  basic  system  parameters — including  modulation  type, 
carrier  frequency,  bit/symbol  rate,  coding  scheme,  and  so  on — in  a  manner  that  is  dependent  on  a 
regular  sensing  of  the  operating  environment  is  introduced.  This  notion  is  abstracted  mathemati¬ 
cally,  resulting  in  a  concrete  model  for  a  time-varying  communication  system  in  the  presence  of 
a  time- varying  channel.  The  goal  of  the  research  is  to  find  classes  of  high-capacity  time-varying 
systems  that  can  react  favorably  to  time-varying  channels.  Using  the  developed  mathematical 
abstraction  as  a  framework  and  this  overarching  goal  as  a  guiding  principle,  a  sequence  of  math¬ 
ematical  problems  is  posed.  The  practicality  of  real  time-varying  systems  that  are  based  on  the 
abstraction  is  discussed  in  light  of  modern  communication  system  technologies  such  as  high-speed 
DSP  and  software  radio. 
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1  Introduction 

This  document  is  a  record  of  MRC’s  ISP  program  effort  that  relates  to  communication  systems.  In 
particular,  a  specific  notion  of  the  integration  of  sensing  and  processing  (ISP)  for  arbitrary  commu¬ 
nication  links  is  proposed.  This  notion  is  set  against  the  backdrop  of  traditional  communication- 
system  design  philosophy  and  is  used  to  develop  mathematical  abstractions  that  lead  to  specific 
mathematical  problems  of  ISP  interest.  The  posing  and  study  of  these  problems  makes  up  the  ISP 
work  related  to  communication  systems. 

The  remainder  of  this  document  is  organized  as  follows.  The  traditional  approaches  to  commun¬ 
ication-system  design  are  briefly  described  in  Section  2  and  ISP  approaches  are  defined  in  Section 
3.  The  ISP  approaches  are  used  to  develop  overarching  mathematical  abstractions  of  communica¬ 
tion  systems  under  ISP  in  Section  4,  and  specific  mathematical  problems  are  then  posed  in  Section 
5.  Section  6  addresses  the  gap  between  the  mathematical  problems  and  the  engineering  design  of 
real  systems  that  are  capable  of  achieving  some  or  all  of  the  ISP  gains  in  throughput,  robustness 
to  channel  impairments,  or  error  probability.  Section  7  provides  mathematical  problem  analysis, 
Section  8  contains  a  discussion  of  results  and  some  numerical  examples,  and  concluding  remarks 
are  provided  in  Section  9.  Appendix  A  contains  some  relevant  information-theoretic  definitions 
and  results. 


2  Conventional  Communication  System  Design  Philosophy 

Conventional  design  approaches  can  be  divided  into  the  traditional  approach  and  the  modem  ap¬ 
proaches. 

2.1  Traditional  Approach 

In  the  traditional  approach  to  communication-system  design,  modeling  and  measurement  are  used 
to  characterize  (perhaps  incompletely)  the  physical  channel  and  the  source.  Then  each  block  in 
the  basic  block  diagram  in  Figure  1  is  separately  designed  and  optimized.  A  basic  difficulty  with 
this  approach  is  that  it  cannot  adapt  to  deviations  from  the  assumed  channel  model.  That  is,  it  can 
neither  take  advantage  of  unusually  good  conditions  nor  defend  against  bad  conditions  that  render 
the  nominal  system  completely  ineffective  (high  error  rate). 


Source 


Channel 


Figure  1:  Basic  communication-system  block  diagram. 
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2.2  Modern  Approaches 

In  more  recent  approaches  [17 ]— [25],  the  traditional  approach  has  been  modified  by  allowing  one 
or  more  of  the  basic  blocks  in  Figure  1  to  possess  an  adaptive  character  or  by  combining  two 
or  more  of  the  blocks  and  treating  the  result  as  an  integral  subsystem  to  be  optimized.  Adaptive 
equalizers  are  examples  of  the  former  and  the  most  visible  example  of  the  latter  is  trellis-coded 
modulation,  arising  from  the  combination  of  the  channel  coder  and  the  modulator,  as  shown  in 
Figure  2. 


Trellis-Coded  Modulation 


Trellis-Coded  Demodulation 


Figure  2:  Block  diagram  for  a  communication  system  employing  trellis-coded  modulation. 

Another  important  example  of  the  more  flexible  modem  approach  is  multicarrier  modulation  or 
OFDM.  The  signal  can  be  viewed  as  a  set  of  adjacent-channel  digital  QAM  signals  with  identical 
symbol  rates.  The  constellation  used  in  each  subcarrier  can  be  varied  to  adapt  to  changing  channel 
conditions,  resulting  in  variable  bit  rate  across  the  subchannels. 

Although  the  modern  approaches  do  provide  means  for  adapting  to  changing  channels  (and 
possibly  to  changing  data-rate  and  quality-of-service  demands),  they  do  not  go  as  far  as  possible 
in  integrating  environment  sensing  with  processing.  In  particular,  many  other  parameters  may  be 
adjusted  to  react  to  time-varying  channels,  including  modulation  type  (not  just  the  constellation), 
coding  schemes  and  parameters,  center  frequency,  antenna  pattern,  RF  bandwidth,  spreading  gain, 
etc. 


3  Integrating  Sensing  and  Processing  for  Communication  Sys¬ 
tems 

To  integrate  sensing  with  processing  in  the  communication-system  context,  two  things  are  required: 
(1)  the  ability  to  sense  one  or  more  aspects  of  the  environment,  and  (2)  the  ability  to  alter  one  or 
more  aspects  of  the  processing  in  response  to  sensing.  For  our  purposes,  the  environment  is  not 
limited  to  the  physical  channel,  but  also  includes  possibly  time- varying  user  demands  on  data  rate 
and  error  performance. 

We  can  divide  the  class  of  ISP-enabled  communication  systems  into  those  that  require  signifi¬ 
cant  cooperation  between  source  and  destination  and  those  that  do  not.  The  former  will  be  called 
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cooperative  ISP  systems  (CISPS),  and  the  latter  autonomous  ISP  systems  AISPS.  In  autonomous 
systems,  the  destination  senses  the  environment  on  its  own  and  makes  its  own  choices  for  operating 
parameters  accordingly.  No  feedback  between  destination  and  source  is  required,  as  illustrated  in 
Figure  3.  Of  course,  this  severely  limits  the  set  of  adjustable  parameters. 

Channel 


Source  Dest  senses 

environment 
and  adapts 

Figure  3:  Illustration  of  the  autonomous  ISP  communication  system  concept. 

In  cooperative  systems,  the  destination  senses  the  environment  with  cooperation  from  the  source 
and  makes  choices  that  are  communicated  to  the  source  or  that  must  be  negotiated  with  the  source. 
This  requires  either  a  logical  or  physical  subchannel  for  passing  side  information  back  and  forth, 
as  illustrated  in  Figure  4. 


Channel 


Source 


\  /  / 

Destination 

/  i 


x 


Dest  and  source 
jointly  adapt 


Dest  senses 
environment 


Figure  4:  Illustration  of  the  cooperative  ISP  communication  system  concept. 

For  all  the  possible  ISP-enabled  communication  systems,  we  propose  the  following  concept  of 
operation: 

Let  the  set  of  adjustable  parameters  at  the  destination  be  denoted  by  Pa,  and  let  the 
destination  perform  a  set  of  measurements  each  T  seconds.  Based  on  the  measure¬ 
ments,  adjust  one  or  more  parameters  in  Pa  to  improve  error  performance  and/or 
increase  data  rate. 

A  major  goal  of  this  work  is  the  development  of  a  suitable  mathematical  framework  for  predicting 
performance  improvements  due  to  the  use  of  the  proposed  ISP  concept  of  operations. 


3.1  Conceptual  ISP  Examples 

In  this  section  we  provide  a  few  examples  of  the  ISP  system  concepts  to  fix  the  autonomous  and 
cooperative  notions. 

Simple  Autonomous  Parameter  Adjustment. 

In  this  first  example,  the  destination  senses  the  environment  and  makes  decisions  regarding  the 
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best  parameter  adjustment  but  does  not  communicate  the  adjustment  to  the  source.  The  destination 
senses  the  channel  by  adjusting  its  position  in  space  and  measuring  the  quality  of  the  communica¬ 
tion  for  the  new  position.  Since  each  new  position  induces  a  new  channel,  one  of  the  new  positions 
will  correspond  to  the  best  new  performance  level.  The  sensor  then  moves  to  this  position. 

To  be  more  specific,  let  the  sensor  position  at  time  t  be  denoted  by  p(t)  and  the  number  of 
candidate  positions  be  restricted  to  Pn.  Let  the  Pn  vectors  r3 ,  j  —  1,2 ,n  represent  points  on  a 
sphere  centered  at  the  origin  and  with  radius  r  meters.  Then  the  sensor  measures  communication 
quality  at  each  position  p(t )  +  r3  and  selects  the  position  with  the  highest  quality  for  p{t  +  T). 

Note  that  the  radius  r  need  not  be  particularly  large  for  this  scheme  to  work  well.  For  in¬ 
stance,  if  the  channel  experiences  deep,  long  fades,  small  changes  in  position  can  move  the  sensor 
completely  out  of  the  fade,  increasing  SNR  by  tens  of  dB.  Note  also  that  the  ‘autonomous’  la¬ 
bel  for  this  kind  of  ISP  does  not  preclude  the  use  of  repeated  known  training  sequences  sent  by 
the  source.  Channel  or  BER  estimates  estimates  based  on  such  training  sequences  represent  two 
quality  measures.  A  measure  that  is  independent  of  pilot  symbols  or  training  sequences  is  the 
constellation-quality  measure  that  assesses  the  tightness  of  the  constellation  cluster  elements  at  the 
destination. 

Random  Cooperative  Parameter  Adjustment. 

In  this  simplest  CISPS,  the  destination  assesses  the  communication  quality  through  some  means 
(use  of  periodically  repeated  pilot  symbols  or  training  sequences  or  constellation  quality  measures), 
and  when  the  quality  is  judged  poor,  the  destination  chooses  a  set  of  system  parameters  at  random 
and  relays  this  information  to  the  source  through  the  feedback  channel.  A  limiting  case  corresponds 
to  continuous  random  parameter  adjustment  in  which  the  quality  is  always  judged  to  be  poor. 

The  amount  of  information  to  be  sent  to  the  source  through  the  feedback  channel  can  actually 
be  quite  small  if  the  source  and  destination  each  have  access  to  a  table  of  system-parameter  options. 
The  random  destination  decision  could  then  be  communicated  to  the  source  by  the  transmission  of 
a  single  small-integer  table  index.  To  be  specific,  the  CISPS  could  use  one  of  four  parameter  sets 
shown  in  Table  1.  Note  that  these  parameter  sets  include  differences  in  modulation  type,  coding 
scheme,  center  frequency,  and  symbol  rate  (bandwidth).  The  2FSK  parameter  set  is  more  likely 
to  result  in  acceptable  communication  for  the  poorer  channels,  whereas  the  64QAM  parameter  set 
will  maximize  data  rate  for  the  better  channels. 

Preset  Cooperative  Parameter  Adjustment. 

In  the  next  CISPS,  the  destination  again  measures  the  communication  quality  and  makes  a  de¬ 
cision  regarding  whether  to  switch  parameter  sets.  Here,  however,  the  destination  can  choose 
only  two  new  parameter  sets  for  each  current  parameter  set.  One  of  the  sets  corresponds  to  de¬ 
graded  quality  and  the  other  to  improved  quality  relative  to  the  previous  quality  assessment.  Thus, 
the  parameter-set  index  is  a  Markov  process  with  transition  probabilities  jointly  specified  by  the 
stochastic  channel  model  and  application  design  constraints,  as  illustrated  in  Figure  5  for  the  pa¬ 
rameter  sets  in  Table  1 .  The  basic  idea  is  that  the  system  senses  the  environment  and  tries  to  match 
its  parameters  to  the  environment  but  only  has  a  limited  number  of  choices  and  cannot  select  the 
parameters  arbitrarily,  but  must  walk  up  (or  down)  the  sequence  of  progressively  more  (or  less) 
capable  system  parameter  sets. 

Optimal  Cooperative  Parameter  Adjustment. 

In  the  third  CISPS,  the  destination  attempts  to  select  the  best  parameter  set  from  its  allowable  sets. 
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Set 

No. 

Modulation 

Type 

Coding 

Scheme 

Carrier 

Frequency 

Symbol 

Rate 

1 

2FSK 

Conv.,  R 

=  1/2 

200  MHz 

20  kHz 

2 

OOK 

None 

250  MHz 

30  kHz 

3 

QPSK 

Conv.,  R 

=  4/5 

300  MHz 

50  kHz 

4 

64QAM 

Conv.,  R 

=  4/5 

300  MHz 

50  kHz 

Table  1:  Four  example  parameter  sets  for  the  random  cooperative  parameter-adjustment  example. 


Thus,  all  transitions  between  parameter-set  states  are  potentially  valid.  The  probability  p(j\k)  of 
transitioning  from  state  k  to  state  j  is  simply  the  probability  of  state  j,  p(j).  This  scheme  may  be 
preferable  when  the  channel  is  sensed  with  very  little  error,  but  when  the  sensing  is  error-prone  or 
crude,  the  previous  CISPS  may  be  superior  because  the  system  is  prohibited  from  making  unwise 
radical  changes  in  the  parameter  set. 

3.2  Summary  of  Basic  ISP  Approach 

The  basic  ISP  notion  advanced  in  this  report  is  the  periodic  sensing  of  a  time-varying  environ¬ 
ment  combined  with  a  means  for  modifying  communication-link  processing  and  transmission  pa¬ 
rameters  accordingly.  The  concept  generalizes  modem  communication-system  notions  such  as 
rate-adaptation  and  multi-carrier  modulation  by  allowing  virtually  any  parameter  to  be  modified, 
such  as  carrier  frequency,  modulation  type  (not  limited  to  constellation  type),  coding  scheme  and 
parameters,  etc.  For  example,  a  system  may  adaptively  switch  between  frequency  hopping,  direct- 
sequence  spread  spectrum,  and  large- alphabet  digital  QAM  in  response  to  the  presence  of  narrow- 
band  interferes,  shadowing  or  fading,  high  SNR,  or  time-varying  throughput  and  error  constraints. 


4  Mathematical  Abstraction 

In  this  section  we  set  out  to  abstract  our  ISP  communication  problem  in  general  mathematical 
terms.  This  abstraction  will  form  the  basis  for  posing  interesting  and  relevant  mathematical  prob- 
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lems. 

First,  note  that  each  system  parameter  set  effectively  induces  a  discrete  channel,  as  illustrated  in 
Figure  6.  Often  we  will  be  interested  in  the  subclass  of  discrete  memoryless  channels  (DMCs),  for 
which  each  output  is  statistically  dependent  only  on  the  corresponding  input  and  not  on  previous 
inputs  or  outputs.  A  general  DMC  is  characterized  by  the  input  alphabet  xi ,  x2, . . . ,  xK,  the  output 
alphabet  2/i ,  2/2 ,  -  -  - ,  Vj,  the  channel  transition  probabilities  Pj\k, 

Pj\k  =  Prob(y  =  yj\X  =  xk), 

and  the  prior  probabilities  q2, . . . ,  <h<-  as  illustrated  by  the  graph  in  Figure  7.  The  probability 
Pj\k  is  the  probability  that  the  channel  output  is  decided  as  yj  given  that  the  channel  input  was  xk- 
The  probability  qt  is  the  prior  probability  of  transmitting  the  letter  xl. 

The  transition  probabilities  are  determined  by  the  physical  channel,  the  modulation,  and  the 
demodulator.  For  example,  consider  uncoded  BPSK  signaling  on  the  additive  white  Gaussian 
noise  (AWGN)  channel  with  perfectly  coherent  demodulation  and  equiprobable  inputs.  Then  J  = 
K  —  2  and  the  probability  of  bit  error  is  given  by  the  well-known  formula 


where  Eb  is  the  energy  per  bit,  N0  is  the  noise  spectral  density,  and 

coo  1 

Q(x)  =  /  . — e-"2  du. 

Jx  V 2^ 

Denoting  x\  as  0,  x2  as  1,  yi  as  0,  and  y2  as  1,  we  obtain  the  binary  DMC  shown  in  Figure  8. 


Discrete  Channel 


Figure  6:  The  basic  communication  system  model  with  induced  DMC  highlighted. 

Time-Varying  Systems. 

For  our  time-varying  communication-system  setup,  the  induced  DMC  is  a  function  of  time.  That 
is,  the  alphabets,  alphabet  sizes,  and  transition  probabilities  are  functions  of  time.  To  model  the 
interval-oriented  aspect  of  our  particular  situation,  let  t  index  each  interval  of  length  T  seconds. 
The  the  DMC  is  considered  constant  on  each  interval  t.  For  the  tth  interval,  the  input  alphabet  size 
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Figure  7:  A  diagram  of  a  generic  discrete  memoryless  channel. 


(1/2) 

(1/2) 


Figure  8:  DMC  for  BPSK  signaling. 

is  defined  to  be  K'  =  K(t)  and  the  output  alphabet  size  is  J'  =  J(t).  This  time-variant  DMC  can 
be  diagrammed  as  shown  in  Figure  9. 

So  far  our  time-varying  system  setup  is  rather  general — a  sequence  of  DMCs — and  has  not 
been  explicitly  connected  to  a  model  for  the  channel  evolution  or  to  a  model  for  system-parameter 
selection.  First,  let’s  model  the  channel.  Then  we  will  connect  the  channel  to  the  DMC  and  finally 
connect  the  DMC  choice  for  interval  t  +  1  to  the  DMC  and  channel  estimate  for  interval  t. 

Channel  Models. 

For  each  time  interval  indexed  by  t,  we  will  assume  that  the  channel  remains  in  a  particular  state. 
The  channel-state  random  variable  is  denoted  by  S  and  it  takes  on  the  A  values  of  si,  s2, . . . ,  sa- 
The  channel-state  random  variable  for  the  fth  interval  is  St  and  we  would  like  to  characterize  the 
set  {S't}  of  random  variables  for  all  t.  We  assume  that  the  channel  cannot  make  radical  changes 
over  reasonable  times  (that  is,  T),  and  that  the  channel  evolves  with  primary  influence  coming  from 
its  recent  history.  Thus,  we  can  model  the  channel  evolution  using  a  first-order  Markov  process 
(after  Gallager  [2]), 

P(St\St-lt  St.  2, St-K )  =  P(St|St_i),  VK  >  1  and  Vi. 
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7,(1) 

y2(i) 

Vs  (t) 


yj'O) 


Figure  9:  General  time-variant  DMC  for  ISP  modeling. 


This  model  does  not  preclude  independent  channel  states  for  which 

mist_o  =  m)  (p(st  =  8j),j  =  i,...,a). 

We  now  present  a  few  simple  examples  of  our  channel  model. 

1.  Static  AWGN.  Let  A  =  1. 


2.  Fast-Fading  Channels.  Let  A  —  2,  si  denote  the  unfaded  state, 
A  typical  state-transition  diagram  is  shown  below. 


and  s2  denote  the  faded  state. 

P(2|2)  =  0.1 


3.  Slow-Fading  Channels.  Let  A  =  2,s\  denote  the  unfaded  state,  and  s2  denote  the  faded  state. 
A  typical  state-transition  diagram  is  shown  below. 

p(2|2)  =  0.3 


4.  Shadowed  Channels.  Let  A  —  2,  s±  denote  the  normal  state,  and  s2  denote  the  shadowed  state. 
A  typical  state-transition  diagram  is  shown  below. 
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p(1|1)  =  0.9999 


p(2|2)  =  0.99 
(or  1.0) 


Shadowed 


p(1 12)  =  0.01 

(or  0.0) 


The  channel  state  affects  a  fixed  DMC  only  through  the  transition  probabilities.  So,  each 
transition  probability  becomes  an  explicit  function  of  the  channel  state  rather  than  the  more  general 
function  of  t, 

which  renders  the  transition  probabilities — and  hence  the  DMC — random  variables  rather  than 
arbitrary  functions  of  time,  as  shown  in  Figure  10. 


Vi  (t) 

y2(t) 

YsW 


Yj  (t) 


Figure  10:  Time-variant  channel-dependent  DMC  for  ISP  modeling. 


DMC  Selection  based  on  Channel-State  Sensing. 

Let  us  now  impose  additional  structure  on  the  time-variant  input  and  output  alphabets  and  prior 
probabilities  that  define  a  DMC.  The  central  notion  is  that  the  system  should  sense  the  channel 
state  and  use  that  information  to  select  the  next  parameter  set. 

Suppose  that  the  system  measures  the  current  channel  state  with  some  error  and  arrives  at  the 
state  estimate  random  variable  Zt.  For  each  time  interval  indexed  by  t,  we  define  the  system  as 
the  input  and  output  alphabets  and  the  prior  distribution  on  the  input  alphabet.  We  assume  that  the 
destination  records  all  previous  channel-state  estimates  and  all  previously  selected  system  choices. 
Let  there  be  B  possible  systems  denoted  by  o?i,  ofo, . . . ,  dg  and  let  the  random  variable  Dt  denote 
the  system  at  interval  t.  The  system  Dt  is  modeled  as  some  function  of  the  current  channel-state 
estimate  and  all  previous  system  choices, 

Dt+  i  =  f(Zt,  Dt,  Dt  i, . . .). 
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Let  us  further  suppose  that  only  the  current  channel-state  estimate  and  the  current  system  are  used 
to  choose  the  next  system, 

Dt+i  =  f(Zt,Dt). 

We  can  model  this  probabilistically  by  another  Markov  process, 

P{Dt+i\Zt,  Dt:  Dt- i, . . .)  =  P(Dt+i\Zt,  Dt). 

Now,  if  the  destination  chooses  a  system  at  random  without  regard  to  either  Zt  or  Dt,  then 

P(Dt+i\Zt,  Dt)  =  P{Dt+ 1)  (P(Dt+ 1  =  dj),j  —  1, . . . ,  B ). 

Our  DMC  diagram  now  requires  three  graphs:  one  for  the  channel  process,  one  for  the  system 
process,  and  one  for  the  communication  process,  as  shown  in  Figure  1 1 .  To  complete  our  notation, 
let  the  channel-state  transition  probabilities  be  denoted  by  ry \a, 

f j\a  —  P(Sf  —  Sj\St—l  —  Sa): 

and  the  system-state  transition  probabilities  be  denoted  by  Vj\a^, 

^j\a,b  —  P{Dt  —  dj\Sf—l  =  Sa,Dt—  1  —  db) . 

Special  Cases  of  the  General  ISP  Communication- System  Model. 

Here  we  point  out  some  special  cases  of  our  ISP  model  having  practical  or  theoretical  interest. 

1.  Classic  DMC  on  a  Static  Channel. 

Here  A  =  B  =  1,  meaning  that  the  communication  system  is  fixed  and  the  channel  has  a 
single  state.  The  alphabets,  priors,  and  transition  probabilities  are  time-invariant.  This  DMC 
applies  to  many  of  the  conventional  results  on  communications  systems. 

2.  Classic  DMC  on  a  Time-Variant  Channel. 

Here  A  >  1  and  B  —  1,  meaning  that  the  alphabets  and  priors  are  fixed,  but  the  channel 
transition  probabilities  are  time- varying  [2]  due  to  the  evolving  channel  state. 

3.  Variable  DMC  on  a  Static  Channel. 

Here  A  —  1  and  B  >  1,  meaning  that  the  channel  has  a  single  state  and  the  DMC  is  time- 
varying.  (This  case  may  have  little  practical  value.) 

4.  Uncorrelated  DMC  and  Channel. 

Here  the  sequence  of  DMC  systems  {Dt}  is  uncorrelated  with  the  sequence  of  channel-state 
estimates  {Ztj.  For  example,  the  system  does  not  use  measurements  to  select  the  DMC,  but 
merely  selects  it  at  random.  Alternatively,  the  DMC  selection  is  influenced  by  time-varying 
source  data-rate  constraints. 

5.  Unconstrained  DMC  Selection. 

Here  the  system  Markov  process  characterized  by  P  ( Dt  \  Z,_  j ,  Dt- 1 )  degenerates  to  P(Df  j  Zt_\ 
That  is,  the  choice  of  the  system  parameters  is  not  influenced  by  the  current  system,  only  by 
the  current  channel-state  estimate. 
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Communication  Process  Diagram 


Yi(Dt) 


y2(°t) 


y3(°t) 


y  j(Dt) 


1  |sa,1  }  a=1 


B|sa,1  J  a=1 


System  State  Dt  Transition  Diagram  Channel  State  St  Transition  Diagram 

Figure  11:  Final  diagram  for  time- variant  DMC  for  ISP  modeling. 
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6.  Constrained  DMC  Selection. 

Here  the  system  and  channel  are  first-order  Markov  processes.  The  choice  of  system  param¬ 
eters  is  explicitly  influenced  by  the  previous  system  and  the  current  channel- state. 


Connection  to  AISPS  and  CISPS  Systems  in  Section  3. 

For  the  AISPS  example  in  Section  3.1  in  which  the  destination  samples  performance  at  each  of  a 
set  of  perturbed  positions  and  selects  the  best  as  the  new  destination  position,  we  have  an  instance 
of  special  case  5  above.  The  channel-state  estimate  in  this  situation  is  the  collection  of  quality 
measures  indexed  by  the  position  vector. 

For  the  random  cooperative  parameter  adjustment  CISPS  example,  we  have  an  instance  of 
special  case  4,  and  for  the  preset  cooperative  parameter  adjustment  example,  we  have  an  instance 
of  special  case  6.  Finally,  for  the  optimal  cooperative  parameter  adjustment  example,  we  have 
another  instance  of  special  case  5. 

5  Specifi  c  Mathematical  Problems  of  Interest 

Now  that  we  have  developed  a  general  framework  for  specifying  mathematical  models  for  a  wide 
class  of  ISP-enabled  communication  systems,  we  would  like  to  analyze  the  models  to  determine 
their  structure  and  their  performance. 

The  central  question  before  us  is:  What  is  the  benefit  of  an  ISP  communication  system  relative 
to  a  conventional  system?  We  desire  to  pose  and  solve  problems  that  help  answer  this  question. 

Communication  systems  are  judged  by  five  criteria:  optimality,  error  performance,  and  power, 
bandwidth,  and  computational  efficiency.  For  the  first  of  these,  the  best  possible  system  would 
transmit  at  a  rate  very  near  the  Shannon  capacity  with  excellent  error  performance.  What  we 
would  like  to  do  here  is  find  a  class  of  ISP-enabled  systems  that  have  a  much  larger  capacity  than 
their  traditional  (conventional)  counterparts,  then  find  engineering  solutions  that  approximate  these 
systems  with  reasonable  power,  bandwidth,  and  computational  complexity  at  a  high  performance 
level.  On  the  other  hand,  large  capacity  gains  may  not  be  found  using  the  present  formulation 
without  mandatory  order-of-magnitude  increases  in  bandwidth,  power,  or  complexity. 

So  the  first  step  in  our  mathematical  analysis  of  the  proposed  ISP  communication  system  mod¬ 
els  is  finding  their  capacities  [2]-[  1 6]  since  this  will  guide  us  to  the  situations  with  highest  potential 
ISP  payoff.  We  include  in  the  problem  statements  below  several  baseline  problems  that  also  serve 
as  warm-up  exercises. 

Problem  Statements. 

1.  For  an  AWGN  channel,  derive  equivalent  DMCs  for  MFSK,  MPSK,  and  MQAM.  Include 
also  DSSS  BPSK  and  FH  signals  if  time  permits. 

2.  Given  a  general  time-invariant  DMC,  derive  a  formula  for  capacity.  Evaluate  the  formula 
for  a  variety  of  parameters  such  as  K,  J,  and  the  transition  probabilities.  Make  sure  results 
default  to  known  or  published  results. 

3.  Given  a  static  system  D  and  a  variable  channel  with  A  >  1  states  and  a  Markov  probability 
structure,  derive  a  formula  for  capacity.  Evaluate  numerically  and  compare  to  static-channel 
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baseline.  This  problem  deals  with  a  fixed  communication  system  facing  a  time-varying 
channel. 

4.  Given  a  static  channel  S  and  a  variable  DMC  with  B  >  1  possible  system  states  and  a 
memoryless  system  state  sequence  (systems  are  chosen  independent  of  the  channel  state  es¬ 
timates  with  some  probability  distribution  over  the  B  systems),  derive  capacity  and  evaluate 
numerically.  This  problem  deals  with  a  time-varying  communication  system  in  the  presence 
of  a  static  channel. 

5.  For  a  model  employing  a  variable  channel  (^4  >  1)  and  a  variable  DMC  ( B  >  1),  assume 
that  the  best  system  is  always  chosen  for  the  current  estimated  channel  state.  Consider  two 
cases:  (1)  the  channel  state  is  estimated  perfectly  and  (2)  the  channel  state  is  estimated  with 
a  known  error  rate.  Derive  and  evaluate  capacity. 

6.  Given  A  >  1,  B  >  1,  and  Dt  and  St  first-order  Markov  with  each  system  state  dj  connected 
to  at  most  itself  and  two  additional  states,  derive  and  evaluate  capacity. 


6  Engineering  Issues 

We  expect  that  if  any  large  capacity  or  performance  gains  are  predicted  by  the  mathematical  anal¬ 
ysis  of  our  ISP  model,  they  will  require  relatively  sophisticated  implementations.  In  particular,  the 
ability  to  radically  and  quickly  alter  modulation  type  will  be  greatly  enhanced  by  the  use  of  the 
emerging  software-radio  and  high-speed  computational  technologies.  Similarly,  the  ability  to  radi¬ 
cally  and  quickly  modify  the  carrier  frequency  across  a  wide  band  will  be  enhanced  by  cutting-edge 
switching  and  oscillator  technology. 

Significant  engineering  issues  are  expected  to  center  on  how  best  to  design  any  required  peri¬ 
odically  transmitted  pilot  symbols  and  on  the  specific  elements  that  enter  the  modifiable-parameter 
set  Pa. 

7  Problem  Analysis 

In  this  section  we  present  our  analysis  results  for  the  problems  posed  in  Section  5. 

7.1  Formal  Definitions 

Definition  1  (Input  Alphabet)  Let  the  input  alphabet  for  a  static  ( time -invariant)  discrete  memo¬ 
ryless  channel  be  defined  by  the  K  numbers  xi, ,  Xk-  For  a  dynamic  (time-varying)  DMC,  the 
size  of  the  alphabet  is  variable.  For  the  ith  DMC  Dt,  the  alphabet  size  is  K(Dt  )  =  K' . 

Definition  2  (Output  Alphabet)  Let  the  output  alphabet  of  a  static  DMC  be  defined  by  the  J 
numbers  y\ .....  yj.  For  the  ith  DMC  Di  in  a  sequence  of  DMCs,  the  size  is  a  function  of  Dp 

J '  -  J  {Di). 
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Definition  3  (Prior  Input  Probabilities)  Let  the  prior  probabilities  on  the  input alphabet for  DMC 
Di  be  denoted  by  q\(i,  Di), . . . ,  qx^D^if  Di).  For  each  i  we  require 

K(Di ) 

Y  qk{i,Di)  =  1. 

k= 1 

For  a  static  DMC,  the  prior  probabilities  are  denoted  simply  by  qk,  k  =  1, ...  .K. 

Definition  4  (Transition  Probabilities)  The  transition  probabilities  for  a  static  DMC  with  I\  in¬ 
put  letters  and  J  output  letters  is  defined  by 

P{  Y  =  VM  =  xk)  =  p{j\k)  =  Pj\k, 

for  k  =  1, ....  K  and  j  —  1 , . . . ,  J.  For  the  ith  DMC  Di  in  a  sequence  of  DMC s  and  a  dynamic 
channel  with  channel  state  Si,  we  have 

P(yi  =  Vj\ Xj  =  Xk,  Si,  Di)  =  P{j\k,Si,Di)  =Pj\k(Si,  Di). 

When  the  channel  is  static  the  conditional  transition  probabilities  reduce  to  pj\k(Di)  and  when  the 
DMC  is  fixed  they  reduce  to  Pj\k(Si). 

Definition  5  (Dynamic  Channel  Model)  The  dynamic  channel  is  modeled  as  a  first-order  Markov 
process  with  A  states  s\, . . . ,  sa ■  The  channel  state  for  the  ith  time  interval  is  modeled  by  the  ran¬ 
dom  variable  Si.  The  channel-state  transition  probabilities  obey  the  relation 

P(,Si- (-1  Sj 1  Si  Sj2 ,  Si— i  Sj^  ,  .  .  .)  |  Si ,  Si  —  i,  .  .  .) 

=  P(Si+1\Si), 


for  alli,j  1,j2, . . .. 

Definition  6  (System  Parameter  Set)  A  system  is  defined  by  the  selection  of  an  input  alphabet, 
output  alphabet,  and  prior  probabilities  on  the  input  alphabet.  A  system  is  typically  denoted  by  D. 

Definition  7  (Channel  Estimator  Model)  The  channel  estimate  for  time  interval  i  is  denoted  by 
Zi ;  this  is  an  estimate  of  Si  and  can  take  on  the  values  s\, . . . ,  Sa ■  The  estimator  is  a  function  of 
the  current  channel  state  only  and  is  modeled  probabilistically  as 

P{Zi  =  SjfSi  =  Sj2)  —  P{Zi\Si). 

For  perfect  channel-state  estimation,  P(Zi  =  sJ1  St  =  Sj2)  is  1  for  ji  =  j2  and  is  zero  otherwise. 

Definition  8  (Dynamic  System  Model)  The  system  evolves  as  a  function  of  previous  systems  and 
the  previous  channel-state  estimates.  The  system  in  the  ith  time  interval  is  denoted  by  Di  and  can 
take  on  one  of  the  B  values  di, ... ,  ds-  The  system  sequence  is  modeled  as  a  Markov  process, 

P (Dj-i- 1  —  d[  |iSj  —  Sj1,  Si— i  —  Sj2 , ,  Di  —  dkl ,  Di  i  —  dk2  j  ■  •  •)  —  -P(_Dj_|_i  —  di  |  Di  —  dkl, 

Si  =  $j1 ) 

—  P{Di+i\Si,  Di), 
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Definition  9  (Channel-Use  Sequence)  The  random  variables  x  and  y  denote  channel  inputs 
and  outputs.  Thus,  x  takes  on  the  values  x\.. . . ,  x  k  and  y  takes  on  the  values  y\,  ...  .  y,j.  Consider 
a  sequence  ofN  successive  channel  uses.  Let  the  N -vectors  x  =  [xi, . . . ,  xjv]  and  y  —  [ yi , . . . ,  yjv] 
denote  the  random  inputs  and  outputs,  respectively.  The  ith  element  of  x,  xt,  takes  one  of  the  values 
Xi, . . . ,  Xk(da  and  similarly  y ,  takes  one  of  the  values  yi, ,  yj(Di)-  Let  u  =  [w,i, . . . ,  Un]  and 
w  =  [w  i, . . . , ;v  ]  denote  valid  values  for  x  and  y,  respectively.  The  sequence  of  channel  states  is 
S  —  [Si,  . . . ,  SN],  where  Si  can  take  on  the  values  si, ...  ,sa-  Similarly,  the  sequence  of  systems  is 
D  =  [Di, . . . ,  Dn]  where  Di  E  {di, . . . ,  de).  Let  <x  =  [<j\, . . . ,  a^r]  and  8  =  [5i, . . . ,  denote 
valid  values  for  S  and  D,  respectively. 

Definition  1 0  Symmetric  Channel  A  DMC  is  symmetric  if  the  set  of  outputs  can  be  partitioned  into 
subsets  in  such  a  way  that  for  each  subset  the  matrix  of  transition  probabilities  has  the  property 
that  each  row  is  a  permutation  of  each  other  row  and  each  column  is  a  permutations  of  each  other 
column  [  ]. 

7.2  Problem  1:  Equivalent  DMCs  for  Various  Modulation  Types 

This  aspect  of  the  work  was  deferred  in  favor  of  the  other  work  reported  on  herein,  and  due  to 
changes  in  contract  funding  and  objectives,  we  were  unable  to  return  to  it. 

7.3  Problem  2:  Static  DMC  and  Channel 

Theorem  1  (Mutual  Information  for  Static  DMC  and  Static  Channel) 

The  average  mutual  information  for  a  static  system  on  a  static  channel  is  given  by 


Theorem  2  (Capacity  for  a  Static  DMC  and  Static  Channel) 

The  capacity  of  a  static  system  on  a  symmetric  static  channel  is  achieved  by  using  equiprobable 
input  letters,  and  is  given  by 


C  =  ma xI(X]Y) 

{Qk} 


7.4  Problem  3:  Static  DMC  and  Dynamic  Channel 

Theorem  3  (Mutual  Information  for  Static  DMC  and  Dynamic  Channel) 

The  average  mutual  information  for  a  static  system  facing  a  dynamic  channel  over  N  channel  uses 
is  given  by 


I(XN ;  Yn)  —  P(y  —  w\x  —  u)P(x  —  u)  log 


P(y  =  w\x  =  u) 


P(y  =  w) 
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where 


P(y  =  w\x  —  u) 


■ 


N- 1 


^i)  n  p(,5J+i = aj+ii5J- 

j=i 


x 


'  iV 

1^-P(Y/c  =  WfcjXyfc  =  Uk,Sk  =  (Tyfe) 

Jfc=l 


p(y  =  w)  = 


N- 1 


p(pi  =  ^0  n  P(Si+1  =  Gi+ il5J  =  ai 


j=i 


AT 


’  l  P{yu  =  wk\Sk  =  ak) 


,k= 1 


and 


K 

P{yk  =  wk\sk  =  crfc)  =  ^P(yk  =  tOfclxfe  =  2^,5*  =  c^)®, 

J=1 


N 

P{ X  =  u)  =  JJ  qtj ,  =  xtj . 

i=1 


Theorem  4  (Capacity  for  a  Static  DMC  and  Dynamic  Channel) 

The  capacity  of  a  static  system  on  a  symmetric  dynamic  channel  is  achieved  by  using  equiprobable 
inputs  for  each  system,  and  is  given  by 


C 


=  lim  max  -^-I(XN;  YN) 
N-+oc  {qk}  N 

=  lim  I{XN]YN) I 
N^oo  N  V  'U  = 


m=VK 


7.5  Problem  4:  Dynamic  DMC  and  Static  Channel 

Theorem  5  (Mutual  Information  for  Dynamic  DMC  and  Static  Channel) 

The  average  mutual  information  for  a  dynamic  system  facing  a  static  channel  over  N  channel  uses 
is  given  by 


I(XN]  Yn )  =  ^p(y  =  w\x  =  u)P{x  =  u)  log  =  — )  > 


where 


N—l 


P(y  =  w\x  =  u )  =  J2P(Di  =  Si)Y[P(Dj+1  =  5j+1\Dj  =  5j) 


j=i 


x 


N 


P{yk  =  Wk\xk  =  uk,  Dk  =  5k) 


Lfc=i 
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N- 1 


P(y  =  w)  =  £p(A=<Si)nP(%M  =  >WI^  =  «,) 


X 


(N  K(Si) 

]^[  Y  fflMO-Pfa  =  wi lx»;  =  xuDi  =  <5j)  ]  , 

i= 1  /=1 


and 


AT 


P(®  =  u)  =  Y  P(D  =  5)  n  Qitii 

5  j=l 

Theorem  6  (Mutual  Information  for  Random  DMC  and  Static  Channel) 

The  average  mutual  information  for  a  randomly  selected  DMC  on  a  static  channel  over  N  channel 
uses  is  given  by 

I(XN;  Yn)  =  Y  P{y  =  w\x  =  u)P{ x  =  u )  log  > 


where 


N 


P(y  =  w\x  =  u)  =  J2  II  P(Di  =  ^ 

d  j= 1 


N 


I J  P{yk  =  wk\xk  =  uk,  Dk  =  4) 


k= 1 


P(y  =  w ) 


e  (nm 


a  \j= i 


(JV  K(Si) 

n  =  ^lxi  =  xuDi  =  Si) ) , 

i= 1  Z=1 


and 


JV 


N 


p{x = u) = y  i  n  p^Dk = ^ )  ( n  ®o'.  )  > 

^=1 


Uj  =  Xi. 


<5  \/c=l 


Proposition  1  (Capacity  for  a  Dynamic  DMC  and  Static  Channel) 

The  capacity  of  a  dynamic  system  on  a  symmetric  static  channel  is  achieved  by  using  equiprobable 
inputs  for  each  system,  and  is  given  by 

C  =  lim  max  -J-/ (XN  \  YN) 

N  —too  {qk}  N 

=  lim  -2  I(XN;YN)I  .... 

]V->oc  N  v  J'Ql(j^j)=l/K[5j) 


7.6  Problems  5  and  6:  Dynamic  DMC  and  Dynamic  Channel 

Theorem  7  (Mutual  Information  for  Dynamic  DMC  and  Dynamic  Channel) 

The  average  mutual  information  for  a  dynamic  system  with  channel-state  estimation  facing  a  dy¬ 
namic  channel  over  N  channel  uses  is  given  by 

I(Xn;Yn )  =  YP(y  =  w\x  =  u)p(x  =  «)log  )"^)  ’ 
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where 

P(y  —  w\x  —  u)  —  ^  P(y  —  w\x  —  u,D  —  S,S  =  cr)P{D  —  d\S  —  cr)P(S  —  a), 

S,(T 

N 

P(y  =  w\x  =  u,  D  =  S,  5  =  <r)  =  ]^[  P{y3  =  Wj\xj  =  u0.  Dj  =  SJ:  Sj  =  oj ), 

3=1 

N- 1 

P(D  =  S\S  =  tr)=  P(Di  =  <5i)  n  E  =  5+1 1 Di  =  6j.Zj  =  S‘)P(Zj  =  =  o,), 

j=i  leA 

N- 1 

P(S  =  a)  =  P(Si  =  ai)  P{Sk+ 1  =  CT*+i|S*  =  <?*), 

jfc=i 

(N  K(5i) 

II  £ 

*=i  i=i 
N 

P(  X  —  u)  —  •£P(D  =  S\S  =  <T)P(S  =  «)H  uj  =  xi, 

<r,8  j=1 

and  A  =  {1, . . . ,  A}. 

Remark  1  Whenever  the  system  is  dynamic,  there  are  several  possible  capacity  definitions.  Each 
definition  involves  a  different  maximization  of  average  mutual  information.  In  the  most  general 
case,  we  maximize  over  the  entire  collection  of  priors  {qfi,  A ) } .  a  set  of  size  N  K(dk).  In 
a  less  general  case,  the  priors  for  all  instances  of  the  system  d,  are  equal  and  we  maximize  over 
the  B  sets  of  priors,  which  is  a  smaller  set  of  size  Ylk=i  PA4)- 

Proposition  2  (Capacity  for  a  Dynamic  DMC  and  Dynamic  Channel) 

The  capacity  of  a  dynamic  system  on  a  symmetric  dynamic  channel  is  achieved  by  using  equiprob- 
able  inputs  for  each  system,  and  is  given  by 

C  =  lim  max  —I (XN;  YN) 

N=>oo  {qk}  N 

=  lim  4  I(XN]YN)\ 

AT^-oo  TV  v  '  Ql{j,8j)=l/K(Sj) 

8  Discussion  and  Numerical  Examples 

The  formulas  for  the  most  general  ISP  communication  link — a  dynamic  system  and  dynamic 
channel — have  been  coded  in  MATLAB.  Since  all  the  simpler  cases  are  special  cases  of  this  for¬ 
mula,  all  capacity  formulas  provided  in  the  preceding  theorem  statements  can  be  evaluated.  For 
even  modest  problems  involving  two  systems  (B  =  2)  and  two  channel  states  ( A  =  2),  the  ca¬ 
pacity  formulas  are  costly  to  evaluate  even  for  small  values  of  N  such  as  4  or  5.  Nevertheless,  we 
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present  here  a  few  examples  to  illustrate  the  potential  advantages  of  the  ISP  approach  over  static 
approaches. 

Recall  that  the  basic  idea  behind  the  present  effort  is  to  find  simple  ISP  systems  that  have 
relatively  large  capacity  for  difficult  channels.  To  this  end  we  will  be  interested  in  both  the  absolute 
capacity  of  the  ISP  (dynamic)  link  and  the  ratio  of  capacities  between  the  dynamic  and  static  links. 
Before  we  present  initial  results  in  this  vein,  we  first  present  a  set  of  results  aimed  at  providing 
verification  of  the  formulas  and  software. 

8.1  Example  1:  Verifi  cation  of  Formulas  and  Software 

In  this  first  set  of  examples,  we  provide  evidence  that  the  obtained  formulas  and  their  software 
implementation  provide  correct  results.  Therefore,  we  focus  on  static  systems  and  channels,  for 
which  capacity  results  are  either  known  or  are  obvious. 

We  are  interested  in  two  distinct  kinds  of  transition  probability  functions.  The  first,  called 
flat,  corresponds  to  the  case  in  which  p3\k  =  Pe  for  j  ^  k  and  pk \k  —  1  —  (K  —  1  )Pe.  That 
is,  all  errors  have  equal  probability  Pe.  The  second,  called  exponential,  assigns  progressively 
smaller  probabilities  of  error  to  errors  involving  symbols  with  increasing  distance.  That  is,  pj\k  = 
PeU-K\  where 


F(j ,  k ,  K )  =  min{| j  —  k |  mod  K ,  ||j  —  k\  —  K\  mod  K}. 


for  j  k,  and 

Pk\k  —  i  'y  ^Pk\j- 
i 

Notice  that  the  flat  and  exponential  error  models  are  identical  for  binary  DMCs. 

The  first  result  corresponds  to  a  binary  DMC  facing  a  static  channel.  The  software  is  used  to 
compute  the  capacity  of  the  link  as  a  function  of  the  cross-over  (transition)  error  probability  Pe 
(see  Figure  8).  The  resulting  capacity  is  plotted  in  Figure  12,  which  can  be  compared  to  results 
in  many  communication  and  information-theory  textbooks.  The  capacity  has  a  maximum  of  one 
bit  per  channel  use,  which  is  intuitively  obvious,  and  a  minimum  of  zero  bits  when  the  cross-over 
probability  reaches  0.5. 

The  second  result  corresponds  to  various  K ary  DMCs  facing  static  flat  and  exponential  chan¬ 
nels.  The  capacities  are  plotted  in  Figure  13.  For  the  flat  channel,  notice  that  the  capacities  for 
each  K  are  equal  to  log2(iT)  bits  per  channel  use  when  the  channel  is  perfect  ( Pe  =  0).  Then 
each  capacity  reaches  zero  when  Pe  —  1/  log2(/i),  indicating  that  all  probabilities  in  the  transition 
diagram  are  equal  which  is,  again,  intuitively  pleasing.  For  the  exponential  channel,  the  capacities 
decrease  more  slowly  with  increasing  Pe,  as  can  be  expected. 

8.2  Example  2:  Binary  Systems  Facing  a  Two-State  Channel 

In  this  second  example,  we  consider  a  dynamic  system  consisting  of  two  binary  DMCs  and  a 
dynamic  two-state  channel.  The  example  is  broken  into  two  parts.  The  first  part  deals  with  an 
interferer  that  bounces  back  and  forth  between  the  two  systems’  bands,  and  the  second  part  deals 
with  a  random,  long,  deep  fade  (or  shadow). 
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Capacity  for  Flat/Exponential  Channel  and  Binary  Alphabet 


Figure  12:  Computed  capacity  for  a  binary  DMC  facing  a  static  channel. 


Capacity  for  Exponential  Channel  Capacity  for  Flat  Channel 


Figure  13:  Computed  capacities  for  DMCs  with  K ary  alphabets  in  flat  and  exponential  noise 
channels. 
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P(2|2,1)  =  1.0 

p(2|2,2)  =  0.0 


p(1 12,1)  =  0.0  System  Model 

P(1 12,2)  =  1.0 

Figure  14:  Channel  and  system  state  diagrams  for  the  first  case  in  Example  Two. 

Random  Interferer  or  Jammer. 

The  basic  notion  here  is  that  the  two  DMCs  correspond  to  binary  links  operating  in  disjoint  fre¬ 
quency  bands.  During  any  one  time  interval,  the  interferer  is  in  only  one  of  the  bands.  This  notion 
leads  to  the  channel  and  system  evolution  models  diagrammed  in  Figure  14.  The  nominal  transition 
probabilities  for  each  system-channel  combinations  are  as  follows 

System  Channel  Pe 

I  I  (B- 

1  2  1(T6 

2  1  1(T6 

2  2  0.5 

That  is,  when  the  channel  state  is  equal  to  the  system  index,  the  interferer  is  in  that  system’s  band 
and  communication  is  not  possible.  Thus,  the  system  evolution  model  in  Figure  14  indicates  that 
when  the  channel-state  estimate  is  1  and  the  current  system  index  is  1,  always  switch  to  system  2. 
Similarly,  switch  from  system  2  to  system  1  when  the  channel-state  estimate  is  2.  Finally,  perfect 
(error-free)  channel-state  estimation  is  assumed  in  this  example. 

We  first  compute  the  capacity  for  system  1  facing  the  two-state  channel.  Computational  costs 
limit  the  number  of  channel  uses  in  the  calculations  to  eight.  The  capacity  is  computed  as  a  function 
of  the  transition  (cross-over)  probabilities  for  the  two  system-channel  combinations:  p\  =  Pe  for 
channel  state  1  and  p2  =  Pe  for  channel  state  2.  The  capacity  is  shown  on  the  left  in  Figure  15. 
Clearly,  the  capacity  approaches  zero  when  both  pi  and  p2  approach  0.5.  The  computed  capacity  of 
the  two-system  link  is  shown  on  the  right  in  Figure  15.  Here  the  transition  probabilities  for  system 
two  are  p2  for  channel  state  1  and  pi  for  channel  state  2.  A  more  revealing  look  at  the  results  is 
the  ratio  of  the  dynamic  capacity  to  the  static  capacity,  shown  in  Figure  16.  Here  it  is  seen  that  the 
dynamic  capacity  can  exceed  twice  that  of  the  static  capacity  when  pi  is  large  and  p2  is  relatively 


p(1|1>1)  =  0.0 

p(1|1,2)  =  1.0 


P(2|1 ,1 )  =  1.0 
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Static  System  Facing  Dynamic  Channel  Dynamic  System  and  Channel 


Figure  15:  Static  and  dynamic  system  capacities  for  the  first  case  in  Example  Two. 

small.  For  example,  for  pi  =  0.5  and  p2  =  10-6,  we  have  a  static  capacity  of  0.25  bits  per  channel 
use  and  a  dynamic  capacity  of  0.63. 

Random  Deep  Fade 

In  the  second  part  of  this  example,  we  again  consider  binary  DMCs  and  a  two-state  channel,  but 
in  this  case  the  channel  alternates  between  a  good  channel  and  a  deep-fade  channel.  When  the 
channel  is  faded,  each  of  the  two  binary  DMCs  experience  the  fade,  but  system  two  is  much 
more  tolerant  to  fades  that  system  1.  The  basic  idea  is  that  the  two  system  operate  in  overlapping 
frequency  bands,  but  use  different  modulation  types  (say  coherent  BPSK  and  incoherent  BFSK), 
which  represent  widely  different  combinations  of  power  and  bandwidth  efficiencies. 

The  channel  and  system  evolution  state  diagrams  are  provided  in  Figure  17,  and  the  nominal 
transition  probabilities  for  the  four  channel-system  combinations  are  as  follows 

System  Channel  Pe 

I  I  1(T* 1 2 * * * 6 * 

1  2  0.5 

2  1  10“8 

2  2  0.005 

Thus,  system  2  is  generally  superior  to  system  1  in  both  channel  states,  but  has  some  other  unde¬ 

sirable  qualities  such  as  bandwidth  inefficiency.  When  system  1  encounters  a  fade,  it  switches  to 

system  2,  which  operates  until  the  fade  ends. 

The  capacity  for  static  system  1  and  the  two-state  dynamic  channel  is  shown  on  the  left  in 
Figure  18  as  a  function  of  pi  =  Pe  for  channel  state  1  and  p2  —  Pe  for  channel  state  2  (for  a 

sequence  of  eight  channel  uses).  On  the  other  hand,  the  computed  capacity  for  the  dynamic  link  is 

shown  on  the  right  in  Figure  18.  Here,  the  transition  probabilities  for  system  2  are  always  a  factor 
of  100  smaller  than  those  for  system  1.  The  ratio  of  dynamic  system  capacity  to  static  is  shown  in 
Figure  19.  Note  that  the  ratio  is  never  less  than  unity  and  that  it  gets  very  large  when  the  transition 
probabilities  for  system  one  both  approach  0.5.  For  the  nominal  probabilities  above  (10  6  and  0.5), 
we  have  the  static  capacity  for  system  1  of  0.19  and  the  dynamic  capacity  of  0.48.  When  the  fade 
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Figure  16:  The  static-dynamic  capacity  ratio  for  the  first  case  in  Example  Two. 

is  such  that  system  1  simply  does  not  work,  system  2  still  does,  and  the  capacity  ratio  is  very  large 
because  the  static-system  capacity  is  close  to  zero. 

8.3  Example  3:  Mixed-Rate  Systems  Facing  a  Multi-State  Channel 

In  this  final  example,  we  study  the  challenging  example  involving  a  rate-adaptive  system  facing  a 
dynamic  channel.  The  intent  of  the  example  is  to  compare  a  typical  rate-adaptive  scheme  with  an 
ISP-enabled  adaptive  scheme. 

We  consider  three  possible  bit  rates  corresponding  to  binary,  4-ary,  and  8-ary  modulation  types. 
For  the  typical  rate-adaptive  scheme,  signalling  is  performed  in  a  single  fixed  frequency  band  and 
there  are  four  possible  channel  states,  roughly  corresponding  to  the  SNR  condition  seen  by  the 
current  demodulator:  High,  Moderate,  Low,  and  Blocked.  For  the  ISP  system,  we  simply  perform 
the  rate-adaptation  in  one  of  two  possible  frequency  bands.  When  one  band  becomes  blocked,  the 
system  moves  to  the  other  band.  We  assume  that  the  two  bands  are  never  simultaneously  Blocked. 

Channel-State  Transitions. 

Since  there  are  two  frequency  bands  and  four  possible  states  for  each,  there  are  sixteen  distinct 
channel  states.  We  will  explicitly  rule  out  the  state  in  which  both  bands  are  Blocked,  so  that  the 
total  number  of  possible  states  is  fifteen,  as  defined  in  Table  2. 

To  define  the  channel-state  transition  probabilities,  we  first  assume  that  the  probability  of  stay¬ 
ing  in  the  current  state  is  high.  Also,  transitions  are  allowed  only  between  adjacent  states.  For 
example,  the  Band-1  state  can  transition  between  High  and  Moderate,  but  not  between  High  and 
Low  or  High  and  Blocked.  This  assumption  effectively  imposes  a  slow-variation  constraint  on  the 
physical  channel.  Finally,  once  a  band  is  Blocked,  it  stays  Blocked  for  some  time. 

We  assign  the  same-state  probability  p(i\i)  =  0.91.  All  other  allowed  transitions  are  equiprob- 
able  for  each  i.  For  example,  for  i  =  5  we  have  p(l|5)  =  p(6|5)  =  p(9|5)  =  0.03. 
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P(1|1,1) 

P(1|1,2) 


p(2|1,1)  =  0.0 


1.0 

0.0 


p(2|2,1)  =  0.0 
p(2|2,2)  =  1.0 


p(1 12,1)  =  1.0  System  Model 

p(1 1 2, 2)  =  0.0 


Figure  17:  Channel  and  system  state  diagrams  for  the  second  case  in  Example  Two. 


Static  System  Facing  Dynamic  Channel  Dynamic  System  and  Channel 


Figure  18:  Computed  capacities  for  the  links  in  the  second  part  of  Example  Two. 
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Figure  19:  Computed  capacity  ratio  for  the  second  part  of  Example  Two. 


State 

Label 

Band  1 

Band  2 

1 

High 

High 

2 

High 

Moderate 

3 

High 

Low 

4 

High 

Blocked 

5 

Moderate 

High 

6 

Moderate 

Moderate 

7 

Moderate 

Low 

8 

Moderate 

Blocked 

9 

Low 

High 

10 

Low 

Moderate 

11 

Low 

Low 

12 

Low 

Blocked 

13 

Blocked 

High 

14 

Blocked 

Moderate 

15 

Blocked 

Low 

Table  2:  Channel-state  labels  for  Example  Three. 
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System 

Label 

Band 

Label 

Modulation  M 

1 

1 

2 

2 

1 

4 

3 

1 

8 

4 

2 

2 

5 

2 

4 

6 

2 

8 

Table  3:  System  labels  for  Example  Three. 


System  Evolution. 

To  define  the  system-evolution  probabilities,  first  note  that  there  are  six  possible  systems  as  shown 
in  Table  3.  We  need  to  specify  the  conditional  probabilities 

P(di+1  =  A| di  =  7,  Sj  =  a), 

where  di+ 1  and  di  denote  the  next  and  current  systems,  respectively,  and  s,  denotes  the  current 
channel  state.  Our  system  evolution  is  governed  by  the  following  guidelines.  Whenever  the  current 
state  is  High,  increase  M.  Whenever  it  is  Low,  decrease  M.  When  it  is  Moderate,  decrease  M  if 
M  is  maximum,  increase  if  M  is  minimum.  If  the  current  channel-state  is  Blocked,  move  to  the 
other  frequency  band  and  maintain  M.  This  results  in  a  set  of  system  evolution  probabilities  that 
are  either  unity  or  zero. 

Channel-State  Estimator. 

For  this  experiment,  we  assume  that  the  channel  state  is  perfectly  estimated.  An  example  estimator 
involves  measuring  the  tightness  of  the  clusters  of  received  points  in  the  constellation  diagram. 
For  very  tight  clusters,  the  channel  state  is  estimated  as  High,  for  somewhat  loose  clusters,  it  is 
estimated  at  Moderate,  etc. 

PMC  Transition  Probabilities. 

The  transition  probabilities  depend  on  both  the  current  system  and  the  current  channel  state.  Let 
us  assume  that  the  M  —  2,  4,  and  8  modulation  types  are  well-matched  to  the  Low,  Moderate,  and 
High  SNR  levels.  Mismatches  between  the  value  of  M  and  the  channel  state  result  in  penalties  of 
100.0  and  rewards  of  0.01,  depending  on  the  orientation  of  the  mismatch.  The  nominal  BER  for 
matched  situations  is  1.0e-4.  Thus,  when  the  channel  state  is  Moderate  and  M  =  4,  the  BER  is 
assumed  to  be  1  .Oe-4.  On  the  other  hand,  when  the  channel  state  is  Moderate  and  M  =  2,  the  BER 
is  1.0e-6. 

Non-ISP  Rate-Adaptive  System. 

For  the  Non-ISP  rate-adaptive  system,  only  one  frequency  band  is  available.  Thus,  there  are  only 
four  channel  states  and  three  systems.  If  the  band  becomes  Blocked,  communication  takes  place 
at  a  BER  of  0.5  for  all  values  of  M. 

Results  for  Example  Three. 

For  this  example,  the  computational  burden  is  quite  high  since  the  dimension  of  the  largest  system 
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Channel  Sec 

uence  Fength  N 

2 

3 

4 

5 

Non-ISP 

0.94 

1.07 

1.11 

1.12 

ISP 

1.63 

1.87 

1.97 

2.00 

Table  4:  Capacity  results  for  Example  Three. 


is  eight  and  the  channel  has  many  states.  Therefore,  we  are  able  to  compute  the  exact  capacity 
only  for  a  single  set  of  parameters  (as  described)  and  for  at  most  a  sequence  of  five  channel  uses. 
In  this  case,  the  computed  capacities  are  as  shown  in  Table  4.  Many  variations  are  of  interest  here, 
but  this  single  case  reveals  that  even  for  mild  channel-state  evolutions,  the  benefits  of  ISP-enabled 
modulation  adaptation  are  quite  large.  We  hope  to  expand  upon  this  example  in  future  work. 


9  Conclusions 

The  general  problem  of  integrating  environment  sensing  with  communication-system  processing 
is  addressed  in  this  report.  Traditional  communication-system  design  focuses  on  disjoint  optimiza¬ 
tion  of  the  canonical  processing  blocks  in  a  communication  system.  The  system  is  deployed  and 
has  little  ability  to  adapt  to  unforeseen  conditions.  In  the  present  work,  the  overarching  concept  is 
of  a  communication  system  that  can  sense  and  assess  its  environment  (channel)  and  use  this  infor¬ 
mation  to  make  appropriate  changes  to  one  or  more  system  parameters.  Such  a  system  can  tolerate 
degraded  conditions  and  take  advantage  of  improved  conditions.  Our  first  goal  in  developing  this 
system  notion  is  the  establishment  of  an  abstract  system  model.  The  second  goal  is  computation 
of  channel  capacities  for  the  model.  Progress  toward  these  two  fundamental  goals  is  documented 
herein.  Further  work  will  be  aimed  at  development  and  evaluation  of  engineering  solutions  that 
take  advantage  of  the  increased  capacity  of  the  new  system  model. 
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Appendices 

A  Defi  nitions  of  Capacity 

In  this  appendix  we  provide  a  brief  overview  of  capacity  definitions.  There  is  no  single  universally 
applicably  definition,  although  the  basic  idea  in  all  cases  is  the  maximization  of  average  mutual 
information.  We  review  capacity  for  discrete  memoryless  channels  (DMCs),  discrete  finite-state 
channels  (FSCs),  discrete-time  memoryless  channels,  and  waveform  channels.  The  material  in  this 
appendix  is  based  on  Gallager  [2] . 

A.l  Discrete  Memoryless  Channels 

The  discrete  memoryless  channel  (DMC)  is  defined  by  a  finite  input  alphabet,  a  finite  output  alpha¬ 
bet,  the  prior  probabilities  on  the  input  alphabet,  and  the  transition  probabilities,  as  shown  in  Figure 
20.  The  transition  probability  pj\k  is  the  probability  of  receiving  yj  given  that  xk  is  transmitted. 

Input  Output 

Alphabet  Alphabet 

Vi 

y2 
y3 

yj 


Figure  20:  General  discrete  memoryless  channel  definition. 

The  capacity  of  a  DMC  is  the  maximum  average  mutual  information  between  the  input  and 
output,  where  the  maximum  is  over  the  prior  probabilities, 

C  =  max  I(X-  Y)  =  max/(F;X),  (1) 

Q  Q 


(q  i ) 
(q2) 
(q3) 


Prior 

Probabilities 


Transition 

Probabilities 
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where 


(2) 


For  this  relatively  simple  channel,  the  average  mutual  information  is  easy  to  compute  and  we  obtain 


(3) 


A.2  Finite-State  Channels 

Here  we  have  a  discrete  channel  with  a  form  of  memory.  The  channel  can  take  on  one  of  A  states 
in  each  channel-use  period.  The  time-varying  channel  gives  rise  to  the  notion  of  computing  mutual 
information  over  the  first  N  channel  uses,  finding  the  capacity  by  maximizing  this  quantity,  and 
finally  letting  the  number  of  channel  uses  increase  without  bound.  The  maximization  is  generally 
over  the  prior  probabilities  for  each  of  the  channel  uses.  For  special  channels,  such  as  decompos¬ 
able  channels  in  which  a  state  may  be  reached  from  another  state  but  may  never  leave,  the  priors 
may  indeed  need  to  be  different  over  the  different  channel-use  periods  in  order  to  maximize  mutual 
information. 

The  channel  here  is  modeled  as  a  first-order  Markov  process.  For  N  channel  uses  we  have  the 
input  sequence  xi, . . . ,  xjv  and  the  output  sequence  yi, . . . ,  y n-  Let  the  vectors  x  and  y  denote  the 
input  and  output  sequences,  respectively.  For  each  possible  value  u  of  x  we  have  a  probability 
measure  QN(u).  The  probability  measure  on  the  output  sequence  is  connected  to  the  prior  prob¬ 
abilities  and  the  probability  structure  of  the  channel.  Let  the  ith  channel  state  be  represented  by 
the  random  variable  St  which  can  take  on  one  of  A  states  si,  S2, . . . ,  sa-  By  exploiting  the  Markov 
channel  structure,  we  can  form  the  conditional  output  probability  given  by 


and 


This  type  of  conditional  density  is  necessary  since  the  channel  state  is  allowed  to  depend  on  the 
previous  inputs,  as  required,  for  example,  for  simple  modeling  of  inter-symbol  interference. 

The  mutual  information — conditional  on  S0 — can  now  be  determined  using  QN  and  the  con¬ 
ditional  output  density.  Two  types  of  capacity  are  defined  corresponding  to  the  best  and  worst  case 
values  of  So- 


C  —  lim  CN , 


iV— >•  OO 


(4) 


where 


Cn  =  maxma xIq(XN]  F'^IS'q), 


Iq(Xn-  Yn\S0)  =  QN(x)PN(y\x,  So)  log 


PN(y\x,S0) 


QN{x')PN(y\x',So)\  ’ 
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and  Qn  is  the  collection  of  all  the  probabilities  QN(x).  For  the  alternate  capacity  C_N,  replace  the 
maximum  over  S0  with  a  minimum. 

A.3  Discrete-Time  Memoryless  Channels 

Here  the  channel  input  and  output  are  continuous  random  variables  so  that  the  alphabets  can  be 
infinite.  The  input  letters  are  still  applied  in  succession  so  that  we  retain  the  discrete-time  nature  of 
the  channel.  The  basic  analysis  concept  for  such  channels  is  to  choose  a  finite  subset  of  the  input 
alphabet  and  to  partition  the  output  space  into  a  finite  collection  of  subsets.  Let  the  input  alphabet 
be  xi, ...  ,xk  with  prior  probabilities  qi(xi), . . . ,  qxixK),  and  partition  the  output  space  into  J 
mutually  exclusive  events  that  exhaust  the  space.  Let  these  events  be  denoted  by  yi, ...  ,yj.  The 
average  mutual  information  follows  as 


W)  =  £  qk{xk)PY\x{yj\xk)  log 

i,k 


PY\x{Vj\xk) 


(5) 


and  the  capacity  is 

c  =  ]T/(x;y),  (6) 

where  the  supremum  is  over  all  finite  selections  of  the  xk,  all  priors  qk(xk),  and  all  output-space 
partitions  yj.  A  difficulty  with  this  analysis  is  that  the  input  letters  are  not  contained  in  amplitude 
and  therefore  any  particular  noise  level  (resulting  in  the  transition  probabilities  for  the  channel) 
can  be  overcome  by  simply  choosing  an  input  alphabet  with  very  large  elements.  Therefore,  an 
input  constraint  is  often  imposed  on  the  input  alphabet.  Let  the  constraint  function  be  a  real- valued 
function  /(•).  Then  the  input-constrained  capacity  is  given  by  (5)  and  (6)  with  the  supremum  over 
all  partitions  of  the  output  space,  all  finite  selections  of  the  xk,  and  all  priors  qk  such  that 


K 

^2<lk{xk)f{xk)  <  E. 
k=\ 

For  example,  f(x)  =  x2  represents  an  average  energy  constraint,  and  f(x)  =  \x\  represents  an 
amplitude  constraint. 


A.3.1  Discrete-Time  Memoryless  AWGN  Channel 

A  special  case  of  the  general  discrete-time  memoryless  channel  is  the  discrete-time  memoryless 
channel  with  additive  white  Gaussian  noise  (AWGN).  Let  the  channel  noise  have  zero  mean  and 
variance  a2  and  impose  an  average-energy  constraint 

K 

5 Zqk(xk)x2k  <  E. 

k= 1 

Then  the  capacity  is  given  by 

C  =  ilog(l  +  T).  (7) 
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A.4  The  Waveform  Channel 

For  these  channels,  the  input  and  output  are  functions  of  time,  typically  constrained  to  the  class 
of  L-2  functions  on  the  interval  [0,  T]  (square-integrable  functions).  To  make  use  of  the  previously 
described  discrete-alphabet  discrete-time  machinery,  and  to  simplify  the  probabilistic  analysis, 
we  represent  each  function  as  a  (possibly  infinite)  expansion  onto  a  complete  orthonormal  set  of 
functions.  The  set  of  coefficients  for  a  function  is  then  used  to  specify  the  function,  which  allows 
reasonable  and  tractable  definitions  of  mutual  information  and  capacity  for  the  waveform  channel. 

Let  x(t)  and  y(t)  denote  the  input  and  output  channel  waveforms,  and  {(f>n(t)}  denote  the 
orthonormal  set.  Then  the  expansion  coefficients  for  x(t)  are 


Xr, 


f 


x(t)4>*n(t)  dt 


f 


X{t)<j>„(t)  dt, 


where 


x(t)  =  ^2xn<f>n(t), 
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and  similarly  for  yn.  Furthermore,  let  a?  =  [xi, . . . ,  xn],  U  —  [yi,  ■  ■  ■ ,  Vn\-  Suppose  that  the 
conditional  probability  densities  Pn{u  |a?)  exist  for  all  finite  N.  Then  these  probabilities  play  the 
role  of  channel  transition  probabilities.  Let  the  prior  probability  density  be  denoted  by  qN{x). 
Then  the  mutual  information  between  channel  input  and  output  is  given  by 


=1  i.m 

N—too 


where 


I{x]y)  =  log 


Pn(v\x) 


fXl  9iv(a3ipiv(y|a3i)  dx  1 


The  average  mutual  information  is  the  limiting  version  of  the  expected  value  of  the  mutual  infor¬ 
mation, 

IT(X(t)-,Y(t))  =  lim  E[I(x;y)]  =  lim  I(X-Y). 

N^-oo  A'^-oo 

The  capacity  is  then  defined  as 


C  =  lim  -sup I(X\Y), 

T->  oo  1 


(S) 


where  the  supremum  is  over  all  prior  density  functions  consistent  with  any  channel-input  con¬ 
straints. 

A.4.1  AWGN  Waveform  Channel 

Let  the  output  of  a  waveform  channel  be  the  sum  of  the  input  and  WGN  with  spectral  density 
iVo/2.  Apply  an  input-power  constraint  of  S,  and  let  the  input  be  duration  limited  to  an  interval  of 
length  T  and  have  approximate  bandwidth  W .  Then  the  capacity  per  unit  time  is  given  by 


C  =  Wlog  1  + 


U 


\  WN0 

which  is  Shannon’s  most  famous  channel-capacity  result. 


bits/sec, 


(9) 
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A.5  Discussion 

Presumably  the  waveform-channel  capacity  should  never  be  less  than  the  DMC  or  discrete-time 
memoryless  channel  capacities  when  they  are  compared  properly  by  constraining  the  bandwidth 
and  symbol/waveform  duration.  This  conclusion  is  due  to  the  lack  of  imposed  system  structure  of 
the  waveform  channel  with  respect  to  the  other  channel  types;  the  waveform  channel  can  always 
be  used  as  a  discrete-time  channel. 

For  a  DMC  with  K  letters  in  its  input  alphabet,  the  capacity  for  a  channel  use  each  T  seconds 
(assuredly  achieved  at  infinite  SNR)  is  simply 

CDMc  =  l0yA  bits/sec. 

For  example,  for  the  BSC,  K  —  2  and 

Cbsc  =  l0g^2^  =  j;  bits/sec. 

On  the  other  hand,  for  the  waveform  channel  and  any  SNR,  we  have 

Cwc  =  FFlog2  ^1  +  bits/sec. 

Constraining  the  bandwidth  used  for  the  DMC  to  be  about  1/T  =  W,  we  have 

CD mc  =  W  log2  K  bits/sec. 

Clearly,  the  waveform  channel  capacity  can  be  indefinitely  increased  by  increasing  the  SNR, 
whereas  the  DMC  capacity  cannot.  Even  for  a  fixed  SNR,  the  waveform  channel  capacity  can 
be  much  larger  than  that  for  a  given  DMC. 
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Abstract 

Historically,  transform  coding  of  noisy  sources  has  been  performed  by  first  estimating  the  message  and  then 
quantizing  this  estimate.  We  show  that  much  insight  can  be  gained  by  recognizing  that  it  is  also  optimum  to 
first  transform  the  noisy  observations  into  canonical  coordinates,  quantize,  apply  a  Wiener  filter  in  this  coordinate 
system,  and  then  transform  the  result  back  to  the  original  coordinates.  Canonical  coordinates  are  uncorrelated,  and 
quantization  and  Wiener  filtering  are  applied  to  each  component  independently.  Optimality  of  this  approach  can  be 
proved  assuming  additive  white  quantization  noise.  Half  canonical  coordinates  minimize  the  mean-squared  error 
by  minimizing  the  trace  of  the  error  covariance  matrix  and  full  canonical  coordinates  maximize  information  rate  by 
minimizing  the  determinant  of  the  error  covariance  matrix.  We  also  demonstrate  in  this  paper  that  majorization  is 
the  fundamental  principle  underlying  proofs  of  optimal  transform  coding,  sometimes  in  a  very  direct,  sometimes  in 
a  more  indirect  way. 

Keywords 

Transform  Coding,  Quantization,  Canonical  Coordinates,  Rank  Reduction,  Majorization 

I.  Introduction 

In  this  paper  we  are  interested  in  the  following  problem:  Given  a  finite  bit-budget  of  B  bits,  how  can 
we  most  efficiently  represent  the  information  that  a  noisy  observation  y  G  IRn  contains  about  a  random 
message  x  G  IRm?  The  apparatus  that  we  will  have  at  our  disposal  is  depicted  in  Fig.  1.  First,  the 
observation  y  is  passed  through  a  linear  transformation  A  G  IRmxn,  which  we  call  the  coder.  The  output 
of  the  coder  is  u  =  Ay,  which  is  subsequently  processed  by  a  scalar  quantizer.  That  is,  each  component 
of  u  is  independently  quantized.  The  quantizer  output  u  is  supposed  to  be  an  efficient  representation 
of  the  message  x,  not  the  measurement  y.  To  produce  an  estimate  x  =  Bu  of  the  message  x,  u  is 
linearly  transformed  by  the  decoder  B  G  H!rnxm.  Without  loss  of  generality,  we  suppose  that  m  <  n. 
Furthermore,  we  will  assume  that  x  and  y  have  zero  mean  and  that  we  have  the  necessary  second-order 
information  available,  namely,  the  covariance  matrices  of  x  and  y,  denoted  by  Rxx  =  Gxx7  and  R,/y  = 
Eyy1 ,  respectively,  and  the  cross-covariance  matrix  Rxy  =  Fxy7  .  For  an  adaptive  implementation  of 
our  results  we  would  assemble  M  independent  snapshots  of  [x,  y]  into  matrices  X  =  [xi, ....  x;\/]  and 
Y  =  [y i , ....  y;\f].  The  covariance  matrices  Rxx,  RXJ/,  and  Ryl/  could  then  be  estimated  as  Af_1XX7  , 
M-tXYT,  and  M~lYYT . 

The  problem  can  now  be  re-formulated  as  follows:  First,  how  do  we  choose  A  and  B,  i.e.,  in  what 
coordinate  system  should  we  quantize?  Second,  how  do  we  distribute  the  total  number  of  bits  B  over  the 
components  of  u  so  that  x  is  a  good  estimate  of  x?  To  make  precise  what  we  mean  by  a  “good”  estimate 

May  27,  2003 


76 


SCHREIER  AND  SCHARF:  CANONICAL  COORDINATES  FOR  TRANSFORM  CODING 


3 


assumptions 

Karhunen-Loeve  Transform 

Paper 

AWN 

Gauss 

A  =  B  1 

At  =  A-1 

u  uncoiT 

[1] 

■ 

V 

V 

■ 

[2,  App.  I] 

■ 

■ 

■ 

V 

[3,  Ch.  8.6] 

■t 

■ 

■ 

■ 

V 

[4,  App.] 

■ 

■ 

^/wlog 

V 

This  paper 

■ 

V 

^wlog 

V 

TABLE  I 

A  SHORT  HISTORY  OF  TRANSFORM  CODING  -  WHAT  HAS  BEEN  ASSUMED  (■)  AND  WHAT  HAS  BEEN  PROVED 
(a/);  y/'"'0®  =  PROVED  THAT  IT  IS  POSSIBLE  TO  CHOOSE  A  AS  ORTHOGONAL  WITHOUT  LOSS  OF 
GENERALITY;  B^  =  ONLY  HIGH-RESOLUTION  ASSUMPTION  IS  USED 


we  will  employ  two  different  performance  measures:  E  =  trRee  =  tr  /see7  =  E\\x  —  x||2,  which  is 
the  mean  squared  error  (MSE),  and  V  =  det  Ree,  which  measures  the  volume  of  the  error  covariance 
ellipsoid  and  thus  information  rate  in  the  Gaussian  case.  For  simplicity,  let  us  refer  to  the  problems  where 
we  try  to  minimize  E  and  V  as  the  min-trace  and  min-det  problems,  respectively. 


m 


m 

B 


m 


x 

-o 


Fig.  1.  Transform  Coder 


A.  Historical  Overview 

Noise-free  transform  coding:  The  arrangement  in  Fig.  1  is  commonly  known  as  a  transform  coder. 
Most  transform  coders  considered  in  the  literature  work  on  a  noiseless  measurement  y  =  x.  Given  the 
eigenvalue  decomposition  Rxx  =  UAUT,  the  orthogonal  matrix  U  is  often  referred  to  as  the  Karhunen- 
Loeve  Transform  (KLT)  corresponding  to  x.  A  common  claim  is  the  following:  “The  KLT  is  optimum 
for  noise-free  transform  coding,  meaning  that  the  choice  A  =  Ur  and  B  =  U  (together  with  a  suitable 
bit  assignment  strategy)  minimizes  the  mean  squared  error  E.” 

There  arc,  however,  important  caveats  regarding  this  claim.  They  concern  the  underlying  assumptions 
that  have  been  made  in  order  to  prove  it.  Table  I  gives  a  short  history  of  results  for  noise-free  transform 
coding.  Proving  optimality  of  the  KLT  really  means  establishing  three  properties:  A  =  B  1,  A  is 
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orthogonal,  and  the  quantizer  input  u  is  uncorrelated.  However,  as  can  be  inferred  from  Table  I,  with  the 
exception  of  this  paper,  one  or  two  of  the  KLT’s  properties  have  always  been  assumed  rather  than  actually 
proved. 

The  most  important  classification  of  a  proof  is  according  to  whether  or  not  it  utilizes  the  high-resolution 
assumption.  High-resolution  means  that  the  bit-budget  B  is  asymptotically  large  and  fine  quantizers  arc 
employed.  If  some  additional  smoothness  constraints  arc  made,  then  quantization  noise  can  be  modeled 
as  additive  white  noise,  which  is  independent  of  the  input  signal  [5],  This  leads  to  the  additive  white 
noise  (AWN)  model  for  quantization  [6],  which  we  will  discuss  in  Section  II-A.  Thus,  the  AWN  model 
implies  high-resolution,  but  not  vice  versa.  Without  the  use  of  the  AWN  model,  restrictive  assumptions 
must  be  made  as  in  proofs  [1],  [2],  [3].  This  should  come  as  no  surprise.  Since  a  quantizer  is  an  inherently 
non-linear  device,  we  should  not  expect  the  solution  to  the  min-trace  problem  to  be  the  KLT,  a  result  from 
linear  algebra,  unless  we  have  linearized  the  quantizer  through  the  use  of  the  AWN  model,  or  we  have 
made  the  problem  statement  so  specific  that  we  basically  force  the  solution  to  be  the  KLT.  In  the  paper  that 
introduced  transform  coding,  Huang  and  Schultheiss  [1]  assumed  uncorrelated  quantizer  input  and  then 
proved  that,  in  order  to  minimize  E,  A  must  be  chosen  as  B  1  and  A  must  be  orthogonal.  Intuition  might 
suggest  that  requiring  uncorrelated  quantizer  inputs  is  the  right  thing  to  do.  Strictly  speaking,  however,  [1] 
does  not  prove  the  optimality  of  the  KLT  because  its  central  property  is  anticipated.  Requiring  A  =  B-1, 
as  [2],  [3],  [4]  do,  is  more  reasonable,  because  this  guarantees  E  — >  0  as  B  — >  oo.  The  additional 
assumption  of  orthogonal  A  in  [2],  [3],  however,  is  again  restrictive,  and  we  will  show  in  this  paper  that 
for  performance  measures  other  than  MSE  orthogonal  transforms  can  be  outperformed  by  non-orthogonal 
transforms. 

We  should  also  mention  that  without  the  use  of  the  AWN  model,  optimality  of  the  KLT  can  only  be 
proved  for  Gaussian  input.  In  fact,  Zeger  [7]  has  recently  shown  that,  even  in  the  high-resolution  case, 
KLTs  can  be  strictly  sub-optimum  for  transform  coding  if  the  input  data  is  non-Gaussian.  Thus,  in  the 
table,  the  Gaussian  assumption  can  only  be  dropped  when  the  AWN  model  is  employed. 

In  short,  only  a  linearized  version  of  this  problem,  which  uses  the  AWN  model,  yields  the  KLT  as  the 
general  solution.  Otherwise,  if  the  optimality  of  the  KLT  can  be  shown  for  particular  assumptions,  this 
is  more  an  indication  that  the  problem  was  cleverly  posed  rather  than  evidence  that  the  KLT  is  indeed 
optimum  for  transform  coding  in  general. 

Transform  coding  of  noisy  sources:  If  the  observations  are  not  equal  to  the  message,  then  it  has  been 
shown  in  [8],  [9],  that  for  the  min-trace  problem  it  is  optimum  to  first  find  the  MMSE  estimate  of  the 
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message  and  then  quantize  this  estimate.  This  result  has  been  extended  in  [10]  to  more  general  perfor¬ 
mance  measures,  which  include  weighted  MSE,  and  it  has  been  shown  in  [11]  that  a  particular  weighting 
produces  a  solution  to  the  min-det  problem.  This  establishes  that  the  min-det  problem  can  also  be  decom¬ 
posed  into  estimator  and  quantizer.  However,  as  we  will  demonstrate  in  this  paper,  it  is  not  necessary  and 
not  necessarily  advantageous  to  estimate  first  and  then  quantize. 

B.  Contribution  of  This  Paper- 

In  this  paper,  we  arc  concerned  with  transform  coding  of  random  sources  from  noisy  observations.  We 
extend  known  results  in  the  following  ways: 

•  We  show  that  a  possible  solution  to  the  min-trace  and  the  min-det  problem  is  to  first  transform  the  noisy 
observations  into  a  half  canonical  [12,  p.  330]  or  full  canonical  coordinate  system  [13]  -  [17],  respec¬ 
tively,  quantize,  Wiener  filter  in  this  coordinate  system,  and  then  transform  the  result  back  to  the  original 
coordinates.  Canonical  coordinates  are  uncorrelated,  which  means  quantization  and  Wiener  filtering  arc 
applied  to  each  component  independently.  This  extends  [8],  [9],  [10]  in  that  it  provides  a  concrete  co¬ 
ordinate  system  for  quantization.  Moreover,  our  results  show  that  transform  coders  have  many  different 
implementations:  for  example,  there  are  implementations  where  quantization  precedes  estimation,  and 
vice  versa. 

•  We  generalize  Table  I  to  the  noisy  case,  giving  a  proof  that  invokes  the  AWN  model  for  quantization, 
but  does  not  make  additional  assumptions  regarding  the  transformations  A  and  B.  Moreover,  previous 
proofs  in  Table  I  only  consider  the  min-trace  problem,  but  we  solve  the  min-det  problem  as  well. 

•  We  demonstrate  that  majorization  is  the  fundamental  principle  underlying  proofs  of  optimal  transform 
coding,  sometimes  in  a  very  direct,  sometimes  in  a  more  indirect  way. 

•  We  establish  an  important  connection  between  quantization  and  rank  reduction:  It  has  been  shown 
in  [11],  [12,  p.  330]  that  we  should  use  half  or  full  canonical  coordinates  for  rank  reduction,  as  well. 
From  a  quantization  point  of  view,  rank  reduction  means  assigning  infinitely  many  bits  to  a  number  of 
components  and  zero  bits  to  the  remaining  components,  which  is  sometimes  also  called  zonal  sampling. 
Together  with  our  results,  this  means  that  we  can  first  choose  a  coordinate  system  and  then  decide  how 
many  bits  to  spend  on  how  many  components. 

Our  program  for  this  paper  is  as  follows:  In  Section  II- A  we  give  a  short  introduction  to  quantization 
using  the  AWN  model,  and  in  Section  II-B  we  present  a  concise  overview  of  some  majorization  results. 
Sections  III  and  IV  prove  that,  under  the  min-trace  and  min-det  criterion,  the  right  coordinate  systems  in 
which  to  perform  quantization  arc  half  and  full  canonical  coordinates,  respectively.  Finally,  Section  V 
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looks  at  several  different  implementations  of  rank  reduction  and  quantization  in  canonical  coordinates. 
Each  realization  has  its  benefits  and  brings  its  own  insights. 

II.  Prerequisites 

A.  Quantization 

A  quantizer  can  always  be  modeled  as  an  additive  noise  source,  meaning  that  the  quantizer  output 
u  =  u  +  q  is  equal  to  the  quantizer  input  u  plus  quantization  noise  q.  However,  an  MMSE  Lloyd- 
Max  quantizer  [3,  Ch.  6.2]  only  guarantees  that  Eq  =  0,  Exiiqi  =  0,  and  EiiiUi  =  Euf.  In  general, 
cross-terms  such  as  Eqtq.j  and  Euiqj  arc  non-zero  and  given  by  complicated  expressions.  In  order  to 
make  the  quantization  problem  analytically  tractable,  it  is  common  to  employ  the  additive  white  noise 
(AWN)  model.  It  is  based  on  the  high-resolution  assumption  (fine  quantizers  with  large  number  of  bits) 
and  additional  smoothness  constraints  [5],  [6].  If  we  let  6*  denote  the  number  of  bits  for  quantizing 
component  u,  ,  and  a2.  =  Eu2  the  variance  of  u,,  then  the  main  assumptions  of  the  AWN  model  may  be 
summarized  as  follows: 

EqqT  =  diag  {a2qi , . .. ,  (1) 

<4  =  E(i  =  ca2ui~2h\  i  =  1, rn  (2) 

EuqT  =  0  (3) 

The  constant  c  is  dependent  on  the  distribution  of  m.  If  ut  is  zero-mean  Gaussian,  then  c  =  \/?>n/2  [3, 
Ch.  8.2],  The  advantage  of  the  AWN  model  is  that  a  quantizer  is  modeled  as  an  additive  white  noise 
source  that  is  uncorrelated  with  the  input  signal.  Thus,  the  quantizer  —  an  inherently  non-linear  device 
—  has  been  linearized. 

Property  (2)  is  a  consequence  of  the  high-resolution  assumption.  Without  invoking  the  high-resolution 
assumption,  we  can  model  the  variance  of  q,  more  generally  as 

Eqf  =  a2Uif(bi)  (4) 

in  the  Gaussian  case.  Here,  /(&*)  is  a  non-increasing  function  of  the  number  of  bits  spent  on  component 
ur.  Clearly,  we  expect  to  get  better  performance  by  increasing  ht. 

B.  Majorization  and  Schur-  Convex  Functions 

As  suggested  in  the  Introduction,  majorization  plays  a  central  role  in  quantization.  In  this  section  we 
introduce  the  concept. 
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Definition  1  (Majorization)  [18,  p.  7]  A  real  n  x  1  vector  x  is  said  to  be  majorized  by  a  real  n  x  1 
vector  y,  written  as  x  -<  y,  if 

k  k 


k  =  l,...,n-l 

2—1  2=1 

n  n 

(5) 

2=1  2=1 

(6) 

where  [•]  is  a  permutation  operator  such  that  atm  >  •  •  •  >  x\ny 

Intuitively,  if  x  -<  y,  then  the  components  of  x  are  “less  spread  out”  or  “more  equal”  than  the  components 
of  y .  Note  that  majorization  is  sometimes  also  defined  with  respect  to  a  permutation  operator  that  arranges 
the  components  of  x  in  increasing  order  [19,  Def.  4.3.24],  ami  <  •  •  •  <  x\ny 

The  idea  of  majorization  becomes  most  powerful  when  it  is  combined  with  the  concept  of  Schur- 
convexity.  Functions  that  are  Schur-convex  preserve  the  partial  ordering  of  majorization: 

Definition  2  (Schur-convex  function)  [18,  Def.  3.A.1]  A  real-valued  function  g  defined  on  a  set  D  C 
1RM  is  said  to  be  Schur-convex  on  D  if  x  -<  y  on  D  implies  that  g(x)  <  g( y).  Similarly,  a  function  is 
called  Schur-concave  on  D  if  x  -<  y  on  D  implies  that  g(x)  >  g(  y). 

Schur-convex  functions  arc  necessarily  symmetric  when  they  arc  defined  on  IRn.  However,  functions 
that  are  not  symmetric  on  IRn  can  still  be  Schur-convex  on  the  set  of  ordered  n-tuples,  which  we  define 
as 

Dn  =  {(afi,  —,xn)  :  xi  >  •  •  •  >  xn }.  (7) 

To  prove  that  a  function  is  Schur-convex  or  Schur-concave,  there  arc  a  number  of  results,  which  can  be 
found  in  [18,  Ch.  3],  We  will  use  the  following  proposition: 

Proposition  1:  [18,  3.H.2]  Let  p(x)  =  Y^i=\  hfixfi),  x  £  Vn,  where  each  hi  :  1R  — »  1R  is  differen¬ 
tiable.  Then  g  is  Schur-convex  on  Vn  if  and  only  if 

h'fia)  >  h'i+l(h) ,  a  >  b,i  =  1,  ...,n  -  1  (8) 

where  h'fia)  denotes  the  first  derivative  of  hi  evaluated  at  a.  Analogously,  if  h'fia)  <  h'i+1(h)  whenever 
a  >  b,i  =  1, ...,  n  —  1,  then  g  is  Schur-concave. 

A  classical  result  of  majorization  is  that  if  H  is  an  n  x  n  Hermitian  matrix  with  diagonal  elements 
diag(H)  =  (Hu, ...,  Hnn)T  and  eigenvalues  ev(H)  =  (Ai, ...,  An)T,  then  [18,  Ch.  9.B] 

diag(H)  -<  ev(H).  (9) 
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Since  <y(x)  =  ]~[ ]'=  1  xi  is  a  Schur-concave  function  [18,  3.F.1],  this  immediately  proves  Hadamard's 
inequality: 

n 

Ha  >  det  H.  (10) 

i= 1 

Many  other  inequalities  such  as  the  arithmetic  mean/geometric  mean  (AM/GM)  inequality  or  Minkowski’s 
inequality  can  be  viewed  as  consequences  of  majorization,  as  well  [18]. 

III.  Half  Canonical  Coordinates  Solve  the  min-trace  or  MMSE  Problem 

In  this  section,  we  show  that  for  the  min-trace  problem  the  right  coordinate  system  for  quantization  is 
the  system  of  half  canonical  coordinates  [12,  p.  330].  We  will  provide  two  proofs:  one  that  is  based  on  the 
AWN  model,  and  one  that  stalls  with  more  restrictive  assumptions,  but  does  not  use  the  high-resolution 
assumption.  We  will  demonstrate  how  majorization  is  the  underlying  principle  for  both  proofs. 

We  refer  to  the  notation  introduced  in  Fig.  1.  The  stalling  point  for  both  proofs  is  the  error  vector 
e  =  x  —  x,  which,  without  any  additional  assumptions,  is  given  by 

e  =  (BAy  -  x)  +  Bq.  (11) 


A.  Additive  White  Noise  Model 

If  we  invoke  the  AWN  model,  then  £xqr  =  0  and  Fyq7  =  0,  and  the  error  covariance  matrix 
Ree  =  EeeT  becomes 

Ree  =  E  [(BAy  -  x)(BAy  -  x)T]  +  BR„Br,  (12) 

which  can  be  expressed  as  the  sum  of  three  positive  semi-definite  terms: 

Ree  =  Q  +  (W  -  BA)Rto(W  -  BA)t  +  BR9?Br  (13) 

In  this  equation,  W  =  Rx,jR^'  is  the  Wiener  filter  and  Q  =  R,  ,  —  R^R^Rj  is  its  filtering  error 
covariance  matrix.  It  is  clear  that  we  can  make  the  middle  term  in  (13)  zero  if  we  choose  BA  =  W,  and 
we  will  assume  this  optimum  choice  in  what  follows.  Thus  the  infinite  precision  quantizer  is  a  Wiener 
filter  with  error  covariance  Q.  Since  Q  does  not  depend  on  how  we  select  A  and  B,  minimizing  trRee 
amounts  to  minimizing  tr  BR^B1 .  This  can  be  achieved  with  a  variation  on  the  proof  for  the  noiseless 
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case  in  [4,  Appendix].  Denote  the  i-th  column  of  B  by  b, .  We  then  have  from  (1)  and  (2) 

m 

tvBKqqBT  =  Yja-2bi<T2Ui\\hi\\2  (14) 


i= 1 


>  cm2 


-2b 


n  aui\\hi 


li=l 


1/m 


(15) 


The  inequality  is  an  AM/GM  inequality,  where  we  have  defined 


6= 


1/m 


(16) 


\i=l 


as  the  geometric  mean  of  the  bit  rates  6*.  Since  cr2?  =  (ARro  A.T)a,  Hadamard’s  inequality  yields 

m 

H«l>aet(AKyyA-).  (17) 


i=  1 


Using  this  inequality  and  the  fact  that  det  BB7  =  (det  B)2  in  (15)  we  obtain  a  new  lower  bound 


trBR^B7  >  cm2  26[det(ARraA7 )  det(BB 


Tv\l/m 


n;=i  iib* 


1  /m 


[  (det  B)2 

This  expression  can  in  turn  be  lower  bounded  by  using  Hadamard’s  inequality  once  more  to  arrive  at 


(18) 


tr  BRwBr  >  cm2 ~2b [det ( ARW AT )  det(BBT)]1/m  (19) 

=  crn2-2fe[det(BARraATBr)]1/m  (20) 

=  cm2-26[det(Ra;2/R-1R^)]1/m.  (21) 


This  final  lower  bound  can  be  achieved  if  the  inequalities  we  have  used  become  equalities.  For  the 
Hadamard  inequalities  this  means  that 


ARraAT  =  Di 
BBr  =  D2, 


(22) 

(23) 


where  Di  and  D2  are  both  m  x  m  diagonal  matrices.  The  two  conditions  (22)  and  (23)  determine  the 
coordinate  system  for  u.  The  AM/GM  inequality,  on  the  other  hand,  becomes  an  equality  if 

c2-26i  (ARwAt)  ..  ||b.t||2  =  K,  (24) 

where  K  is  independent  of  i.  This  determines  the  bit  assignment  for  u. 

Let  us  first  talk  about  the  coordinate  system.  From  (22)  and  (23),  and  since  BA  =  W,  we  find  that 
B7  R.ryR R7B  =  D 1 D7  ,  which  implies  that  B  must  diagonalize  R^yR.^1  R7.,r  We  could  thus 
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choose  B  as  the  orthogonal  matrix  U  from  the  eigenvalue  decomposition  RxyR~y  Rxy  =  U(ZZT’)UT, 
and  A  =  U^R^RyJ  =  U7  W.  With  this  choice,  we  are  first  estimating  x  by  passing  y  through 
a  Wiener  filter  W,  and  then  quantizing  the  estimate  x  =  Wy,  as  in  the  noise-free  case.  The  noisy 
quantization  problem  is  thus  reduced  to  a  standard  quantization  problem,  as  long  as  we  observe  that  the 
estimate  x  has  covariance  matrix  Rj^R^Rj  rather  than  Rrr.  This  result  connects  to  a  finding  in  [8] 
and  [9] .  There  it  was  demonstrated  that 


E  =  £7||x  -x 


=  77  x  —  i7(x  y)  2  +  E\\E(x\y)  —  x  2 

(25) 

=  -^filter  “1“  -^quantizer? 

(26) 

which  shows  that  for  the  MSE  criterion  it  is  optimum  to  apply  the  quantizer  to  a  conditional  mean  es¬ 
timator  of  the  message  based  on  the  observations.  The  total  MSE  E  is  the  sum  of  the  infinite  precision 
filtering  error  and  the  error  of  quantizing  the  conditional  mean  estimate.  Our  result  differs  insofar  as  we 
have  shown  that,  using  the  AWN  model  for  quantization  and  a  transform  coding  system,  it  is  optimum  to 
first  obtain  a  linear  MMSE  estimate  x  =  Wy  through  the  Wiener  filter,  and  then  quantize  x.  Gaussianity 
is  not  required  for  our  proof.  For  jointly  Gaussian  message  and  observations  our  finding  coincides  with 
[8],  [9],  but  in  general,  they  are  different. 

Moreover,  our  derivation  also  allows  a  different  interpretation  which  brings  fresh  insight.  If  we  start 
with  the  singular  value  decomposition  (SVD)  [12,  p.  330] 


R^R-y1/2  =  UZVT  =  U 


^mx(n-m) 


1 

1 

<1 

34 

_ i 

- 

1 

<1 

i _ 

(27) 


one  can  check  that  taking  A  =  ZVTRra1^2  and  B  =  U  also  satisfies  (22)  and  (23).  Thus  u  = 
ZVTRyy1/2y,  with  covariance  R„„  =  Z„,Z7f,  is  quantized  for  u,  and  x  is  then  estimated  as  x  =  Uu. 
The  diagonal  elements  of  Zm  are  the  half  canonical  correlations  Zi  between  x  and  y .  We  can  now  express 
the  MMSE  in  terms  of  z,  as 


min  E  = 
A.B 


tr  Rxx  ^  " 


i= 1 


+  TT 


-2l-zl 


(28) 


i= 1 


The  first  term  in  (28)  accounts  for  the  infinite -precision  filtering  error  and  the  second  term  for  the  error 
due  to  quantization. 

The  MSE  E  will  be  minimized  if  bits  are  assigned  according  to  (24),  which  says  that 

c2~ 2 bi  ,/2  — 


=  K, 


(29) 


May  27,  2003 


84 


SCHREIER  AND  SCHARF:  CANONICAL  COORDINATES  FOR  TRANSFORM  CODING 


11 


subject  to  B  =  &?•  The  solution  to  the  bit  assignment  problem  parallels  the  one  for  standard 

transform  coding  if  we  observe  that  the  variance  of  u,  is  zf.  Components  ut  with  greater  squared  half 
canonical  correlation  zf  will  be  assigned  more  bits  and  according  to  [3,  Ch.  8.3]  we  have 


h  = - H  \  log2 

m  z 


n 


j= i  zj 


1/m  ' 


(30) 


Note  that  (22)  and  (23)  allow  the  transformations  A  and  B  to  be  scaled  by  a  non-singular  diagonal 
matrix.  If  A  is  replaced  by  D  1 A  and  B  by  BD,  then  the  quantizer  input  u  will  still  be  uncorrelated 
and  according  to  (24)  the  optimum  bit  assignment  is  left  unchanged.  For  instance,  we  could  choose 
A  =  and  B  =  UZm.  With  this  choice,  the  transformation  u  =  Ay  takes  y  into  a  half  canon¬ 

ical  coordinate  system,  where  the  white,  unit-variance,  half  canonical  coordinates  u  are  quantized.  The 
transformation  B  applies  a  diagonal  Wiener  filter  Zm  in  canonical  coordinates  to  the  quantizer  output  u, 
and  transforms  the  filter  output  back  into  the  original  coordinate  system.  Note  that  in  this  implementation 
quantization  precedes  estimation.  In  Section  V  we  further  explore  how  canonical  correlations  illuminate 
quantization.  We  also  look  at  different  realizations  of  the  min-trace  quantizer. 


B.  No  High-Resolution  Assumption 

If  we  do  not  use  the  high-resolution  assumption,  meaning  that  the  AWN  model  cannot  be  employed, 
either,  we  should  not  expect  a  half  canonical  coordinate  system  to  be  optimum  in  all  generality.  However, 
for  the  min-trace  problem,  we  can  prove  the  following:  Suppose  that  x  and  y  arc  jointly  Gaussian  and 
A  =  U  7  W,  B  =  U,  where  U  is  orthogonal.  Furthermore,  we  model  the  variance  of  the  quantization 
noise  as  in  (4).  Then  given  any  particular  bit  assignment  vector  b  =  (b\. ...,  bj\r)T ,  the  MSE  E  is  mini¬ 
mized  if  Ay  produces  an  uncorrelated  u,  i.e.,  U  diagonalizes  R:),yR.,yy'  R7(/.  While  the  problem  is  very 
restrictive,  the  proof  is  nevertheless  interesting  because  it  makes  obvious  the  role  that  majorization  plays 
in  quantization,  as  we  now  demonstrate. 

First,  notice  that  since  message  and  observation  are  jointly  Gaussian  and  A  =  U7  W,  we  can  use  the 
result  from  [8],  [9],  detailed  in  (25),  and  only  concern  ourselves  with  minimizing 

m 

^quantizer  =  £||Wy  -  ic||2  =  £||U(u  -  u)||2  =  E ||u  -  u||2  =  (31) 

1=1 

This  means  we  must  simply  demonstrate  that  a  KLT  is  optimum  for  quantizing  x  =  Wy,  which  has 
covariance  R^R^R^. 

The  following  generalizes  [2,  Proof  2,  App.  I],  Let  us  first  re-order  the  components  of  u  such  that 
<72.  >  <t2.+i,  i  =  1  —  1.  Then  note  that  minimization  of  (31)  requires  that  bi  >  bi+ 1,  i  = 
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1, m  —  1,  because  f(bi)  is  a  non-increasing  function.  This  conforms  with  intuition:  We  should  clearly 
assign  components  with  greater  variance  more  bits.  With  these  assumptions,  a  =  (cr„  , ...,  <j2m)T  6  Vm 
and  b  e  Vm  arc  both  members  of  the  set  of  ordered  m-tuples  Vm. 

The  quantization  error  ^quantizer  is  a  function  of  cr.  It  follows  from  the  majorization  result 

a  =  diag(Rnu)  -<!  ev(Ruu)  =  ev(RzyR“J/1R^)  (32) 

that  in  order  to  minimize  any  Schur-concave  function  of  cr  G  T>m,  U  must  diagonalize  R^R^R^. 
Thus,  if  we  can  show  that  our  performance  measure  is  Schur-concave  on  Vm,  we  have  proved  optimality 
of  the  KLT.  First  notice  that  since  /  is  non-increasing  and  bi  >  6t+i ,  we  have  fib,)  <  f(bl+\ ).  It  then 
follows  immediately  from  Prop.  1  that  ^quantizer  is  Schur-concave. 

Our  proof  is  more  general  than  the  proof  in  [2]  because  it  shows  that  the  KLT  is  optimum  for  noise-free 
transform  coding  for  all  Schur-concave  performance  measures,  not  just  MSE.  Of  course,  this  statement 
holds  only  for  the  assumptions  stated  at  the  beginning  of  this  section.  In  particular,  U  must  be  orthogonal. 
The  reason  this  proof  can  not  be  extended  to  non-orthogonal  transformations  is  that  ||U(u  —  u)||2  in 
general  depends  on  the  cross-correlations  Eqiqj,  i  /  j,  unless  U  is  orthogonal.  These  cross-correlations 
arc  given  by  complicated  expressions,  which  do  not  easily  admit  a  solution  to  the  minimization  problem 
min  -^quantizer  • 

C.  The  Role  of  Majorization 

The  proof  in  the  previous  section,  which  does  not  invoke  the  high-resolution  assumption,  makes  the 
role  of  majorization  obvious.  If  we  had  complete  control  over  how  to  distribute  tr  R,m  over  the  diagonal 
elements  er,  then  clearly  ^quantizer  as  given  by  (31)  would  be  minimized  by  choosing  =  tr  R„);, 
(Tul  =  0,  i  =  2, ...,  m,  and  spending  the  entire  bit-budget  B  on  quantizing  component  u\.  The  question  is 
how  close  we  can  come  to  this  rank-1  choice  while  observing  the  constraint  that  u  =  Ux.  Majorization 
gives  the  answer.  It  is  apparent  that  the  bigger  the  spread  in  the  vector  cr,  the  smaller  MSE  will  be.  Since 
cr  -*<  evjR.j.yR.yJ  R'r;/),  the  maximum  spread  in  cr,  and  therefore  minimum  MSE,  is  achieved  when  U  is 
a  KLT. 

Let  us  now  demonstrate  how  other  proofs  of  the  KLT’s  MSE  optimality  make  use  of  majorization. 
Goyal  el  al.  in  [2,  Proof  1,  App.  I]  show,  under  the  same  restrictive  requirements  as  the  proof  in  Sec¬ 
tion  III-B,  that  given  any  orthogonal  transformation  T,  there  exists  a  KLT  U  that  yields  MSE  at  most  as 
high  as  T.  They  proceed  by  constructing  a  series  of  Jacobi  rotations  {J*},  which  iteratively  diagonalize 
T /i(xxr)T7  .  Each  .T,  makes  one  off-diagonal  element  zero,  and  acts  only  on  two  diagonal  elements, 
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increasing  one  by  <5,  and  decreasing  the  other  one  by  5.  Since  this  increases  the  spread  of  the  diagonal  el¬ 
ements,  quantizing  Jj+iJj  •  •  •  JiTx  is  better  than  quantizing  J j  •  •  •  JiTx.  Because  U  =  Jk- ■  ■  J2JiT, 
this  iteratively  shows  optimality  of  the  KLT.  In  essence,  this  construction  is  a  complicated  proof  of 
diag(Ruu)  ev(R,m),  as  we  now  show.  Each  transformation  J,  acts  on  the  diagonal  elements  of 
Jj_i  •  •  ■  ,T  i  T/f(xxr)Trj{  •  •  ■  Jf_i  as  a  so-called  T-transformation.  A  T -transformation  has  the  form 
T{ z)  =  (zi,  ...,zk-i,azk  +  (1  -  a)zi,zk+i,  (1  -  a)zk  +  azi,zt+1, ... ,zn)T ,  where  a  G  [0, 1]. 

Since  ev(Ruu)  can  be  derived  from  diag(Rl/n)  by  successive  applications  of  T-transformations,  we 
have  diag(Rnu)  -<  ev(Ruu)  [18,  Ch.  4], 

In  a  more  general  setting,  we  have  proved  optimality  of  half  canonical  coordinates  for  quantization  in 
Section  III-A.  The  key  step  is  Hadamard’s  inequality  (17).  This  inequality  is  a  direct  consequence  of  the 
majorization  result  a  -<  ev  (R«u  ),  as  we  have  already  demonstrated  in  Section  II-B.  Achieving  equality 
in  this  inequality  requires  the  largest  possible  spread  among  the  diagonal  elements  of  Rn)i  =  ARy;/A7  . 
This  leads  to  a  diagonal  matrix  R„u  with  squared  half  canonical  correlations  on  its  diagonal. 

IV.  Full  Canonical  Coordinates  Solve  the  min-det  or  Maximum  Information  Rate 

Problem 

In  this  section,  we  show  that  the  right  coordinate  system  for  u  to  solve  the  min-det  problem  is  the 
system  of  full  canonical  coordinates  [13]  -  [17].  The  proof  will  be  based  on  the  AWN  model.  It  does  not 
seem  possible  to  extend  it  to  the  non-high  resolution  case,  not  even  under  the  restrictive  assumptions  of 
Section  III-B.  The  reason  is  that  detRee  depends  on  the  cross-correlations  Eq,q-j,  which  arc  generally 
non-zero  in  the  absence  of  high-resolution. 

We  start  the  minimization  of  V  =  det  Ree  by  applying  Minkowski's  determinant  inequality  to  (13): 

V  >  ((det(Q  +  BR?,BT))1/m  +  (det  ((W  -  BA)Rw(W  -  BA)T))1/m)”'  (33) 

Since  both  the  term  in  the  left  det(-)  and  the  right  det(-)  expression  are  positive  semi-definite,  we  can 
minimize  V  by  making  the  second  term  zero,  i.e.,  choosing  BA  =  W.  We  will  assume  this  optimum 
choice  in  what  follows.  Using  Minkowski’s  determinant  inequality  once  more  yields 

V  >  ((detQ)1/m  +  (det(BR?9BT))1/m)m  .  (34) 

Since  Q  does  not  depend  on  the  choice  of  A  or  B,  det  (BRr/vB  r)  must  be  minimized  in  order  to  minimize 
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the  bound  on  V.  Similar  to  the  procedure  in  the  previous  section  we  have 

m 

det(BRgc;BT)  =  (c2-26)m  J]  (ARroAr)  ..  det(BrB)  (35) 

i= 1 

>  (c2~2b)m  det(AR.raAT)  det(BTB)  (36) 

=  (c2-26)mdet(Rx,R-J/1R^).  (37) 

Inequality  (36)  is  again  Hadamard’s  inequality,  which  becomes  an  equality  if  AR,/yAv  is  diagonal. 
Minkowski’s  inequality  (34),  on  the  other  hand,  becomes  an  equality  if 

Q  =  K  •  BRqqBT  (38) 

for  some  K  >  0.  With  the  knowledge  that  ARWA  7  must  be  diagonal  for  equality,  this  means 

Q  =  cA'-B[(ARraAT)diag(2-2bl,...,2-2fem)]BT.  (39) 


The  expression  in  square  brackets  is  a  diagonal  matrix.  Thus,  in  order  to  evaluate  what  (39)  implies  for 
B,  we  would  like  to  factor  Q  =  TDqT7',  where  D q  is  diagonal.  To  this  end,  we  need  the  SVD  of  the 
coherence  matrix  [17] 


Rf^2RxyR~^2  =  FKGr  =  F 


*'-m  ^mx(n-m) 


Then  we  can  re-write  (39)  as 


(40) 


Q  =  R^2F(I  -  KKT)FTRj/2  =  cK  •  B[(ARraAr)diag(2-2fel,...,2-2fe-)]BT.  (41) 

It  is  apparent  that  the  following  arc  possible  choices  for  A  and  B  such  that  BA  =  W,  AR,/yA7  is 
diagonal,  and  (41)  is  satisfied: 


A  =  D^KG1^^2  (42) 

B  =  R^2FD  (43) 

Here,  D  is  any  non-singular  diagonal  matrix.  For  D  =  Km,  the  transformation  u  =  Ay  =  GjriB.yy^2y 
takes  y  into  the  full  canonical  coordinate  system,  where  the  white,  unit-vaiiance,  full  canonical  coordi- 
nates  u  are  quantized.  The  transformation  B  =  R,;,4  FKm  applies  a  diagonal  Wiener  filter  Km  in  full 
canonical  coordinates  to  the  quantizer  output,  and  the  filter  output  is  transformed  back  into  the  original 
coordinate  system  with  Bxx  F.  The  diagonal  Wiener  filter  Km  contains  the,  full  canonical  correlations 
ki  between  x  and  y  on  its  diagonal. 
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It  is  instructive  to  express  the  minimum  achievable  value  of  V  in  terms  of  kt : 


mini/  =  [(det  Q)1//m  +  c2  2ft(det  R^R^FtL)1/ 


A,B 


=  det(Ra; 


II1  “  ki 


1/m  /  m  \  1/m 

- 2b  /  TT  1-2  \ 


+  c2 


n  ki 


(44) 

(45) 


^*=1  /  \i=  1  / 

The  first  term  is  the  infinite  precision  filtering  error,  and  the  second  term  is  due  to  quantization  [16]. 
Equation  (41)  also  determines  the  right  bit  assignment  strategy.  We  immediately  obtain 


1  —  kf  =  cK  ■  2  2bikf  i  =  1, ...,  Tu¬ 


rn 


If  we  define  7?  =  k'f  /( 1  —  kf),  we  must  satisfy 


c2“2S?2  =  K. 


(47) 


This  means  we  have  the  same  solution  as  for  (29)  with  half  canonical  correlations  zf  replaced  by  7 2 


k  = - b  \  log2 

m  z 


nm  c, 

3= 1% 


1/m’ 


(48) 


Notice  that,  just  like  the  min-trace  problem,  the  min-det  problem  can  also  be  solved  by  first  computing 
the  lineal-  MMSE  estimate  of  x  as  x  =  Wy  and  then  quantizing  x.  To  see  this,  write  A  =  KG1  RyJ  1  = 
F7  R]i,! ^ 2 R.,;:y R,y J  =  F^  R^'^W,  and  B  =  R^2F.  Therefore,  this  problem  again  may  be  separated 
into  an  estimation  and  a  quantization  problem,  as  depicted  in  Fig.  2  (d).  Observe  that  the  quantizer 
is  a  maximum  information  rate  rather  than  an  MMSE  quantizer.  It  contains  the  coder  F  R,ra. '  “  and 
the  decoder  R.,4  F.  Thus,  even  though  coder  and  decoder  are  inverses  of  each  other,  they  are  non- 
orthogonal,  unlike  the  min-trace  case.  This  establishes  that  for  performance  measures  other  than  MSE 
orthogonal  transformations  can  be  outperformed  by  non-orthogonal  transformations. 

For  jointly  Gaussian  message  and  measurements,  the  separation  into  Wiener  filter  and  quantizer  can 
also  be  deduced  from  Ephraim  and  Gray  [10],  using  a  result  from  Hua  et  id.  [11],  Ephraim  and  Gray 
have  generalized  the  result  of  [8],  [9],  detailed  in  (25),  to  more  general  performance  measures,  including 
weighted  MSE.  Hua  et  al.  have  shown  that  the  min-det  problem  is  equivalent  to  the  weighted  MMSE 
problem  minA.B  tr  (R“^Ree).  Note  that  the  discussion  from  Section  III-A  applies  here,  as  well:  Our 
proof  shows  that,  using  the  AWN  model  and  a  transform  coding  system,  it  is  optimum  (under  the  max¬ 
imum  information  rate  criterion)  to  first  obtain  a  linear  MMSE  estimate  x  =  Wy  through  the  Wiener 
filter,  and  apply  a  maximum  information  rate  quantizer  to  x.  It  follows  from  [10]  and  [11]  that,  under  the 
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maximum  information  rate  criterion,  it  is  generally  optimum  to  obtain  a  conditional  mean  estimate  of  the 
message  based  on  the  observations,  and  then  to  quantize  this  estimate.  For  jointly  Gaussian  message  and 
observations,  the  findings  coincide. 

In  the  proof  that  full  canonical  coordinates  are  optimum  for  maximum  information  rate  quantization, 
the  role  of  majorization  is  concealed  through  the  use  of  Hadamard’s  and  Minkowski’s  inequalities.  For 
the  min-trace  problem  mintr(Ree)  of  Section  III-A,  majorization  requires  maximum  possible  spread 
among  the  diagonal  elements  of  RxyR  j,jR^.  This  leads  to  diagonalization  of  this  matrix,  with  squared 
half  canonical  correlations  on  its  diagonal.  For  the  min-det  problem  of  this  section,  we  are  minimiz¬ 
ing  tr  (Rjrj/2ReeRxJ/2).  Thus,  majorization  requires  maximum  possible  spread  among  the  diagonal 
elements  of  the  squared  coherence  matrix  Ra^^R^R^R^R^J^2.  Again,  achieving  this  maximum 
spread  results  in  diagonalization  of  this  matrix,  with  squared  full  canonical  correlations  on  its  diagonal. 

V.  Different  Realizations 

Rank  reduction  can  be  viewed  as  a  special  case  of  quantization  since  it  amounts  to  assigning  infinitely 
many  bits  to,  say,  r  components  and  zero  bits  to  the  remaining  m  —  r  components.  However,  optimality 
of  canonical  coordinates  for  rank  reduction  cannot  be  deduced  from  the  results  in  the  preceding  sections, 
since  assigning  zero  bits  to  components  violates  the  high-resolution  assumption.  The  only  exception  is 
the  proof  in  Section  III-B,  which  does  not  use  the  high-resolution  assumption,  but  instead  requires  the 
coder  to  be  of  the  form  A  =  U7  W,  U  orthogonal,  and  jointly  Gaussian  message  and  measurement.  If 
we  define  the  rate-distortion  function  as  /( 0)  =  1,  /( oo)  =  0,  we  have  fU>t)  =  0  for  i  =  1,  ...,r,  the 
components  we  keep,  and  /(&;)  =  1  for  i  =  r  +  1, ...,  m,  the  components  we  purge.  Then  the  proof  in 
Section  III-B  directly  shows  optimality  of  half  canonical  coordinates  for  rank  reduction  under  the  MMSE 
criterion,  albeit  under  the  restrictive  assumptions  mentioned  above. 

However,  it  has  already  been  proved  that  some  system  of  canonical  coordinates  is  optimum  for  rank 
reduction  in  all  generality,  without  making  any  restrictive  assumptions  at  all.  Again  half  canonical  coor¬ 
dinates  minimize  the  trace  [12,  p.  330]  and  full  canonical  coordinates  minimize  the  determinant  of  the 
error  covariance  matrix  [11],  Together  with  our  results  the  important  implications  are  these:  Suppose  we 
have  a  reduced  rank  Wiener  filter,  designed  to  either  control  MMSE  or  information  rate.  Then  suppose 
this  filter  is  to  be  quantized.  The  resulting  reduced  rank  quantized  structure  retains  the  original  coordi¬ 
nate  system  and  replaces  infinite  precision  internal  coordinates  with  quantized  coordinates.  That  is,  the 
coordinate  system  does  not  change. 

Fig.  2  displays  several  different  implementations  of  reduced  rank  quantizers  in  full  canonical  coor- 
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(d) 

Fig.  2.  Different  implementations  of  reduced  rank  quantizers  in  full  canonical  coordinates 

dinates.  The  quantizer  Q  can  assign  zero  bits  to  a  number  of  components,  which  means  that  only  an 
r-dimensional  statistic  is  stored  or  transmitted  in  these  implementations.  Before  reconstruction,  this  r- 
dimensional  statistic  is  augmented  with  zeros  to  build  again  an  m-dimensional  vector,  as  indicated  in  the 
figure  by  the  r  — >  m  building  block. 

Each  line  of  Fig.  2  is  insightful.  Lines  (a)  and  (b)  whiten  with  Rra  ,  resolve  onto  the  basis  for 

1  /2 

<  Gm  >,  quantize  and  filter  (or  vice  versa),  reconstruct  in  the  basis  <  F  >,  and  color  with  R.,4  .  There 
are  implementations  where  explicit  estimation  precedes  quantization  and  vice  versa.  For  example,  line  (c) 
shows  the  quantized  estimator  to  consist  of  whitening,  analysis  onto  the  basis  for  ^  Gr m  ^5  quantizing, 
synthesis  in  the  basis  for  <  Gm  coloring,  and  filtering.  In  a  storage  or  transmission  application,  only 
u  =  Q[G^Rra1/2y]  would  be  stored  or  transmitted,  and  WRy'/G,,,  would  be  computed  at  the  receiver. 
Quantization  in  half  canonical  coordinates  can  be  implemented  very  similarly,  simply  replacing  FKG7 
with  UZVT  and  R,  ,  with  the  identity  in  Fig.  2. 

VI.  Conclusions 

We  have  shown  that  transform  coding  of  noisy  sources  is  a  story  of  majorization,  either  directly,  or 
indirectly  through  the  use  of  Hadamard,  AM/GM,  and  Minkowski  inequalities.  We  have  proved  that  the 
right  coordinate  systems  for  quantization  are  the  systems  of  half  and  full  canonical  coordinates.  Half 
canonical  coordinates  minimize  the  trace  and  full  canonical  coordinates  minimize  the  determinant  of  the 
error  covariance  matrix.  It  has  been  proved  earlier  [12,  p.  330],  [11],  that  canonical  coordinates  are 
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optimum  for  rank  reduction,  as  well.  Together  with  our  results,  this  means  that  we  can  first  choose  a 
coordinate  system  and  then  decide  how  many  bits  to  spend  on  how  many  components. 

When  looking  at  different  transform  coding  schemes,  it  is  essential  to  be  very  clear  about  the  underly¬ 
ing  assumptions.  Our  proofs  that  canonical  coordinates  arc  indeed  optimum  for  transform  coding  require 
the  use  of  the  AWN  model  for  quantization.  Without  at  least  the  high  resolution  quantization  assumption, 
very  restrictive  assumptions  arc  needed  to  prove  optimality  of  canonical  coordinates.  However,  assump¬ 
tions  that  arc  too  restrictive  usually  limit  performance:  for  instance,  orthogonal  transforms  arc  in  general 
inferior  to  non-orthogonal  transforms  for  performance  criteria  other  than  MSE,  in  particular  maximum 
information  rate. 

Finally,  a  remark  regarding  the  extension  to  the  complex  case:  It  is  often  stated  that  quantization  of 
complex  vectors  is  essentially  the  same  as  for  real  vectors,  as  long  as  the  definition  of  the  inner  product  is 
changed  from  <  x,  y  >=  x7  y  to  <  x,  y  >=  x/;y,  where  x77  is  the  complex  conjugate  transpose  of  x. 
Thus,  in  this  paper  it  should  suffice  to  redefine  covariance  matrices  as  Rx?/  =  Ex yH  etc.  This,  however,  is 
only  true  if  message  and  observation  arc  jointly  proper,  which  means  that  the  complementary  covariance 
matrices  Exx7 ,  Ex yT,  and  Eyy1  arc  all  zero.  If  they  are  not,  linear  algebra  will  not  give  optimum 
performance  and  widely  linear  transformations  must  be  used  instead  [20].  Canonical  coordinates  for 
improper  complex  random  vectors  arc  described  in  detail  in  [21]. 
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ABSTRACT 

Historically,  transform  coding  of  noisy  sources  has  been 
performed  by  first  estimating  the  message  and  then  quan¬ 
tizing  this  estimate.  We  show  that  it  is  also  optimum  to 
first  transform  the  noisy  observations  into  canonical  coor¬ 
dinates,  quantize,  apply  a  Wiener  filter  in  this  coordinate 
system,  and  then  transform  the  result  back  to  the  original 
coordinates.  Canonical  coordinates  are  uncorrelated,  and 
quantization  and  Wiener  filtering  are  applied  to  each  com¬ 
ponent  independently.  Optimality  of  this  approach  can  be 
proved  assuming  additive  white  quantization  noise.  Half 
canonical  coordinates  minimize  the  mean-squared  error  by 
minimizing  the  trace  of  the  error  covariance  matrix  and  full 
canonical  coordinates  maximize  information  rate  by  mini¬ 
mizing  the  determinant  of  the  error  covariance  matrix. 

1.  INTRODUCTION 

In  this  paper  we  are  interested  in  transform  coding  of  noisy 
sources.  Thus,  we  are  looking  for  an  answer  to  the  ques¬ 
tion:  Given  a  finite  bit-budget  of  B  bits,  how  can  we  most 
efficiently  represent  the  information  that  a  noisy  observation 
y  £  1R"  contains  about  a  random  message  x  £  1R"'?  A  trans¬ 
form  coder  is  depicted  in  Fig.  1 .  First,  the  observation  y  is 
passed  through  a  linear  transformation  A  £  IRmx",  which  we 
call  the  coder.  The  output  of  the  coder  is  u  =  Ay,  which  is 
subsequently  processed  by  a  scalar  quantizer.  That  is,  each 
component  of  u  is  independently  quantized.  The  quantizer 
output  u  is  supposed  to  be  an  efficient  representation  of  the 
message  x,  not  the  measurement  y.  To  produce  an  estimate 
x  =  Bu  of  the  message  x,  u  is  linearly  transformed  by  the 
decoder  B  £  IRmxm.  Without  loss  of  generality,  we  suppose 
that  m  <  n.  Furthermore,  we  will  assume  that  x  and  y  have 
zero  mean  and  that  we  have  the  necessary  second-order  in¬ 
formation  available,  namely,  the  covariance  matrices  of  x 
and  y,  denoted  by  Ru  =  ExxT  and  R™  =  EyyT ,  respec¬ 
tively,  and  the  cross-covariance  matrix  Rvv  =  Ex yT . 

This  work  was  supported  by  the  DARPA  ISP  program  under  contract 
AFRL  F33615-02-C-1 198  and  the  2001  NSF  ITR  Initiative  under  contract 
CCR01 12573. 


-s- 


B 


n  m  ' — 1  m  m 

Fig.  1.  Transform  Coder 


The  problem  can  now  be  re-formulated  as  follows:  First, 
how  do  we  choose  A  and  B,  i.e.,  in  what  coordinate  system 
should  we  quantize?  Second,  how  do  we  distribute  the  total 
number  of  bits  B  over  the  components  of  u  so  that  x  is  a 
good  estimate  of  x?  To  make  precise  what  we  mean  by  a 
“good”  estimate  we  will  employ  two  different  performance 
measures:  E  =  trR^  =  tr Eeer  =  £||x  — x||2,  which  is  the 
mean  squared  error  (MSE),  and  V  =  detR„,,  which  mea¬ 
sures  the  volume  of  the  error  covariance  ellipsoid  and  thus 
information  rate  in  the  Gaussian  case.  For  simplicity,  let  us 
refer  to  the  problems  where  we  try  to  minimize  E  and  V  as 
the  min-trace  and  min-det  problems,  respectively. 

In  this  paper,  we  show  that  a  possible  solution  to  the 
min-trace  and  the  min-det  problem  is  to  first  transform  the 
noisy  observations  into  a  half  canonical  [3,  p.  330]  or  full 
canonical  coordinate  system  [4],  respectively,  quantize,  Wie¬ 
ner  filter  in  this  coordinate  system,  and  then  transform  the 
result  back  to  the  original  coordinates.  Canonical  coordi¬ 
nates  are  uncorrelated,  which  means  quantization  and  Wie¬ 
ner  filtering  are  applied  to  each  component  independently. 
This  extends  previous  results  in  that  it  provides  a  concrete 
coordinate  system  for  quantization.  Moreover,  our  results 
show  that  transform  coders  have  many  different  implemen¬ 
tations:  for  example,  there  are  implementations  where  quan¬ 
tization  precedes  estimation,  and  vice  versa. 

The  proofs  of  optimality  that  we  provide  are  based  on 
the  additive  white  noise  (AWN)  model  for  quantization.  If 
we  let  bj  denote  the  number  of  bits  for  quantizing  compo¬ 
nent  Uj.  and  a2.  =  Euj  the  variance  of  then  the  main 
assumptions  of  the  AWN  model  may  be  summarized  as  fol¬ 
lows: 


£qq7'=diag(o21,...,02m) 

<4  =  Eqf  =  ca2  2~lbi ,  i=l,...,m 

Euqr  =  0 
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this  means  that 


The  constant  c  is  dependent  on  the  distribution  of  u,-.  If  m,  is 
zero-mean  Gaussian,  then  c  =  \/3ji/2. 

2.  MIN-TRACE  PROBLEM 

In  this  section,  we  show  that  for  the  min-trace  problem  the 
right  coordinate  system  for  quantization  is  the  system  of 
half  canonical  coordinates  [3,  p.  330].  Referring  to  the  no¬ 
tation  introduced  in  Fig.  1,  we  can  express  the  error  vector 
e  =  x  —  x  as 

e  =  (BAy  —  x)  +Bq. 

If  we  invoke  the  AWN  model,  then  ExqT  =  0  and  /:yq7  =  0, 
and  the  error  covariance  matrix  Ree  =  /fee7  becomes 

Rw  =  E  [(BAy  —  x)(BAy  —  x)r]  +BRwBr, 

which  can  be  expressed  as  the  sum  of  three  positive  semi- 
definite  terms: 

R„  =  Q  +  (w  -  BA)Ryy(W  —  BA)r  +  BRwBr  (1) 

In  this  equation,  W  =  R  n  R“ 1  is  the  Wiener  filter  and  Q  = 
Rv.t  —  RtvRvy1  is  its  filtering  error  covariance  matrix.  It 
is  clear  that  we  can  make  the  middle  term  in  (1)  zero  if  we 
choose  BA  =  W,  and  we  will  assume  this  optimum  choice 
in  what  follows.  Thus  the  infinite  precision  quantizer  is  a 
Wiener  filter  with  error  covariance  Q.  Since  Q  does  not  de¬ 
pend  on  how  we  select  A  and  B,  minimizing  tr  Ree  amounts 
to  minimizing  trBRwBr.  This  can  be  achieved  with  a  vari¬ 
ation  on  the  proof  for  the  noiseless  case  in  [6,  Appendix]. 
Denote  the  /-th  column  of  B  by  b,.  We  then  have 

trBRwBr  =  £c2-2*'a2||b,-||2 
i=  1 

-  1/m 

<INI2  •  (2) 

The  inequality  is  an  AM/GM  inequality,  with  average  bit 
rate  b  =  B/m.  Since  a7  =  ( ARVVA7 ),/,  Hadamard’s  in¬ 
equality  yields  f]/!Li  Ct«.  >  det(ARv.yA7  ).  Using  this  inequal¬ 
ity  in  (2)  we  obtain  a  new  lower  bound  on  trBR^B7  as 

rnm,iib,n2i1/m 

cm2~lb  [det(ARyyAr)  det(BBr)] 1//m 

This  expression  can  in  turn  be  lower  bounded  by  using  an¬ 
other  Hadamard  inequality  to  arrive  at 

trBRwBr  >  cm2~2b [det(ARxyAr )  det(BBr )] 

=  an2-2Z,[det(R,vRxv1R^)]1/m. 

This  final  lower  bound  can  be  achieved  if  the  inequalities  we 
have  used  become  equalities.  For  the  Hadamard  inequalities 


ARVV  Ar  =  D  i  (3) 

BBr  =  Do,  (4) 

where  Di  and  D2  are  both  in  x  in  diagonal  matrices.  The 
two  conditions  (3)  and  (4)  determine  the  coordinate  system 
for  u.  The  AM/GM  inequality,  on  the  other  hand,  becomes 
an  equality  if 

2-26''(ARyvAr)..||bi||  2  =  M,  (5) 

where  the  constant  M  >  0  is  independent  of  i.  This  deter¬ 
mines  the  bit  assignment  for  u. 

Let  us  first  talk  about  the  coordinate  system.  From  (3) 
and  (4),  and  since  BA  =  W,  we  find  that  B/  R(VRVVI  R7  B  = 
I)7  I)  1 1)7 ,  which  implies  that  B  must  diagonalize  R^R“ 1 R7 
We  could  thus  choose  B  as  the  orthogonal  matrix  U  from  the 
eigenvalue  decomposition  R^R”  R7  =  U(ZZr)Ur,  and 
A  =  U7RavRvvi  =  U7  W.  With  this  choice,  we  are  first 
estimating  x  by  passing  y  through  a  Wiener  filter  W,  and 
then  quantizing  the  estimate  x  =  Wy,  as  in  the  noise-free 
case.  The  noisy  quantization  problem  is  thus  reduced  to  a 
standard  quantization  problem,  as  long  as  we  observe  that 
the  estimate  x  has  covariance  matrix  RyyRyy1  Rjv  rather  than 
Rxv.  This  result  connects  to  a  finding  in  [2],  where  it  was 
shown  that  for  the  MSE  criterion  it  is  optimum  to  apply  the 
quantizer  to  a  conditional  mean  estimator  of  the  message 
based  on  the  observations.  The  total  MSE  E  is  then  the 
sum  of  the  infinite  precision  filtering  error  and  the  error  of 
quantizing  the  conditional  mean  estimate.  Our  result  differs 
insofar  as  we  have  shown  that,  using  the  AWN  model  for 
quantization  and  a  transform  coding  system,  it  is  optimum 
to  first  obtain  a  linear  MMSE  estimate  x  =  Wy  through  the 
Wiener  filter,  and  then  quantize  x.  Gaussianity  is  not  re¬ 
quired  for  our  proof. 

Moreover,  our  derivation  also  allows  a  different  inter¬ 
pretation  which  brings  fresh  insight.  If  we  start  with  the 
singular  value  decomposition  (SVD)  [3,  p.  330] 

R.vyRvv1/2  =  uzvr  =  U  [Zm  0 mx(„_m)]  [S]  , 

Lvo_ 

one  can  check  that  taking  A  =  ZV7  Ry,vl'/“  and  B  =  U  also 
satisfies  (3)  and  (4).  Thus  u  =  ZVrRvv1,/2y,  with  covariance 
R„„  =  ZmZ7,,  is  quantized  for  u,  and  x  is  then  estimated  as 
x  =  Uu.  The  diagonal  elements  of  Zm  are  the  half  canonical 
correlations  Zi  between  x  and  y. 

The  MSE  E  will  be  minimized  if  bits  are  assigned  ac¬ 
cording  to  (5),  which  says  that 

2  “2*'z2  =  M,  (6) 

subject  to  B  =  Iff  |  bt.  The  solution  to  the  bit  assignment 
problem  parallels  the  one  for  standard  transform  coding  if 


m 

>  cm2~2b  J  J 

i=l 
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(10) 


we  observe  that  the  variance  of  «,■  is  zj: 


bj  =  b+\  log2 


(7) 


We  can  now  express  the  MMSE  in  terms  of  Zi  and  M  as 


>  {c2~lb)'n det(ARyyA7')  det(BrB) 

=  (c2~2b)m  det(RYyRyj1Rjv) . 

Inequality  (10)  is  again  Hadamard’s  inequality,  which  be¬ 
comes  an  equality  if  ARVVA7  is  diagonal.  Minkowski’s  in¬ 
equality  (9),  on  the  other  hand,  becomes  an  equality  if 


min  E  = 
A,B 


trRxv 


+  cmM. 


(8) 


The  first  term  in  (8)  accounts  for  the  infinite-precision  filter¬ 
ing  error  and  the  second  term  for  the  error  due  to  quantiza¬ 
tion.  If  the  bit  rate  B  is  given,  then  we  use  (7)  to  assign  bits 
and  (8)  tells  us  the  resulting  MSE.  On  the  other  hand,  if  we 
want  to  achieve  a  given  MSE,  we  can  use  (8)  to  determine 
the  required  M,  and  then  (6)  to  assign  bits. 

Note  that  (3)  and  (4)  allow  the  transformations  A  and 
B  to  be  scaled  by  a  non-singular  diagonal  matrix.  If  A  is 
replaced  by  D  'A  and  B  by  BD,  then  the  quantizer  input  u 
will  still  be  uncorrelated  and  according  to  (5)  the  optimum 
bit  assignment  is  left  unchanged.  For  instance,  we  could 
choose  A  =  V71RvvI',~  and  B  =  UZm.  With  this  choice,  the 
transformation  u  =  Ay  takes  y  into  a  half  canonical  coordi¬ 
nate  system,  where  the  white,  unit-variance,  half  canonical 
coordinates  u  are  quantized.  The  transformation  B  applies 
a  diagonal  Wiener  filter  Z„,  in  canonical  coordinates  to  the 
quantizer  output  u,  and  transforms  the  filter  output  back  into 
the  original  coordinate  system. 


3.  MIN-DET  PROBLEM 

In  this  section,  we  show  that  the  right  coordinate  system  for 
u  to  solve  the  min-det  problem  is  the  system  of  full  canon¬ 
ical  coordinates  [4],  The  proof  will  again  be  based  on  the 
AWN  model.  We  start  the  minimization  of  V  =  detR„,  by 
applying  Minkowski’s  determinant  inequality  to  (1): 

E>((det(Q  +  BRwBr))1/m 

+  (det  ((W  -  BA)Ryy(W  -  BA)r) ) 1/m) 

Since  both  the  term  in  the  left  det(-)  and  the  right  det(-) 
expression  are  positive  semi-definite,  we  can  minimize  V  by 
making  the  second  term  zero,  i.e.,  choosing  BA  =  W.  We 
will  assume  this  optimum  choice  in  what  follows.  Using 
Minkowski’s  determinant  inequality  once  more  yields 

V  >  ((detQ)1/m+  (det(BRwBr))1/m)'”.  (9) 

Since  Q  does  not  depend  on  the  choice  of  A  or  B,  we  must 
minimize  det(BRf/(/Br)  in  order  to  minimize  the  bound  on 
V .  Similar  to  the  procedure  in  the  previous  section  we  have 


Q  =  K  BRWB7 

for  some  K  >  0.  With  the  knowledge  that  ARVVA7  must  be 
diagonal  for  equality,  this  means 

Q  =  cK  •B[(AR-yyAr)diag  (2~2bl ,  ...,2~2bm)]BT .  (11) 

The  expression  in  square  brackets  is  a  diagonal  matrix.  Thus, 
in  order  to  evaluate  what  (11)  implies  for  B,  we  would  like 
to  factor  Q  =  TDgTr,  where  Dg  is  diagonal.  To  this  end, 
we  need  the  SVD  of  the  coherence  matrix  [4] 

R«/2RtyRyyr/2  =  FKG7  =  F  [Km  0mx („_m)] 

Then  we  can  re-write  (1 1)  as 

Q  =  R«2f(i  —  kk7  )F7  r£/2 
=  cK  ■  B[(ARyvA7')diag  (2~2b{ ,  .,n2~2bm)]BT .  (12) 

It  is  apparent  that  if  we  choose  A  =  D  'KG^R  ,1  and 

l/9 

B  —  Rxy  FD,  where  D  is  any  non-singular  diagonal  ma¬ 
trix,  then  BA  =  W,  A R1V A 7  is  diagonal,  and  (12)  is  satis¬ 
fied.  For  D  =  Km,  the  transformation  u  =  Ay  =  G7'  R-yy1  '  2y 
takes  y  into  the  full  canonical  coordinate  system,  where  the 
white,  unit-variance,  full  canonical  coordinates  u  are  quan- 
tized.  The  transformation  B  =  RAAFKm  applies  a  diagonal 
Wiener  filter  K,„  in  full  canonical  coordinates  to  the  quan¬ 
tizer  output,  and  the  filter  output  is  transformed  back  into 
the  original  coordinate  system  with  RAX  F.  The  diagonal 
Wiener  filter  K,„  contains  the  full  canonical  correlations  kj 
between  x  and  y  on  its  diagonal. 

Equation  (12)  also  determines  the  right  bit  assignment 
strategy.  Defining  yf  =  kj /{\  —  kj),  we  must  satisfy 

2~2b‘y j  =  K,  i  =  l,...,m.  (13) 

This  means  we  have  the  same  solution  as  for  (6)  with  half 
canonical  correlations  zj  replaced  by  yj, 

*,  =  *  +  ^  <14> 

We  can  now  express  the  minimum  achievable  value  of  V  in 
terms  of  kt  and  K  as 


det(BRwBr)  =  (c2-26)mn(ARyyAr)..det(BrB) 
i=l 


minU  = 
A,B 


/  r  \  m  ^ 

det(R„)-(l  +  -)  -11(1  ~kj). 

\  A  /  (=1 


(15) 
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Fig.  2.  Reduced  rank  quantizers,  full  canonical  coordinates 


quantizer  Q  can  assign  zero  bits  to  a  number  of  compo¬ 
nents,  which  means  that  only  an  r-dimcnsional  statistic  is 
stored  or  transmitted  in  these  implementations.  Before  re¬ 
construction,  this  r-dimensional  statistic  is  augmented  with 
zeros  to  build  again  an  ///-dimensional  vector,  as  indicated 
in  the  figure  by  the  r  — >  ///  building  block. 

Each  line  of  Fig.  2  is  insightful.  There  are  implemen¬ 
tations  where  explicit  estimation  precedes  quantization  and 
vice  versa.  For  example,  line  (c)  shows  the  reduced  rank 
quantizer  to  consist  of  whitening,  analysis  onto  the  basis  for 
<  G,„  >,  quantizing,  synthesis  in  the  basis  for  <  G„,  >,  col¬ 
oring,  and  filtering.  In  a  storage  or  transmission  application, 

only  u  =  Q  [G7,  Ry /  :/2y]  would  be  stored  or  transmitted,  and 

1  /2 

W Ry  v  G,„  would  be  computed  at  the  receiver.  Quantization 
in  half  canonical  coordinates  can  be  implemented  very  sim¬ 
ilarly,  simply  replacing  FKGr  with  UZV7  and  Rxv  with  the 
identity  in  Fig.  2. 


Similar  to  the  MMSE  quantizer,  if  the  bit  rate  B  is  given, 
then  we  use  (14)  to  assign  bits  and  (15)  tells  us  the  resulting 
V.  On  the  other  hand,  if  we  want  to  achieve  a  given  V,  we 
can  use  (15)  to  determine  the  required  K ,  and  then  (13)  to 
assign  bits. 

Notice  that,  just  like  the  min-trace  problem,  the  min- 
det  problem  can  also  be  solved  by  first  computing  the  lin¬ 
ear  MMSE  estimate  of  x  as  x  =  Wy  and  then  quantizing 
x.  To  see  this,  write  A  =  KG7  R_vv'/2  =  FrR^1/2RxyR“1  = 

F7  Rvt'  W,  and  B  =  R_/2F.  Therefore,  this  problem  again 
may  be  separated  into  an  estimation  and  a  quantization  prob¬ 
lem,  as  depicted  in  Fig.  2  (d).  Observe  that  the  quantizer  is 
a  maximum  information  rate  rather  than  an  MMSE  quan- 
tizer.  It  contains  the  coder  F  Rvt  1  ~  and  the  decoder  R„  F. 
Thus,  even  though  coder  and  decoder  are  inverses  of  each 
other,  they  are  non-orthogonal,  unlike  the  min-trace  case. 

4.  DIFFERENT  REALIZATIONS 

It  has  already  been  proved  that  some  system  of  canonical 
coordinates  is  optimum  for  rank  reduction,  even  without  the 
need  to  invoke  the  AWN  model.  Again  half  canonical  coor¬ 
dinates  minimize  the  trace  [3,  p.  330]  and  full  canonical  co¬ 
ordinates  minimize  the  determinant  of  the  error  covariance 
matrix  [1],  Together  with  our  results  the  important  impli¬ 
cations  are  these:  Suppose  we  have  a  reduced  rank  Wiener 
filter,  designed  to  either  control  MMSE  or  information  rate. 
Then  suppose  this  filter  is  to  be  quantized.  The  resulting 
reduced  rank  quantized  structure  retains  the  original  coor¬ 
dinate  system  and  replaces  infinite  precision  internal  coor¬ 
dinates  with  quantized  coordinates.  That  is,  the  coordinate 
system  does  not  change. 

Fig.  2  displays  several  different  implementations  of  re¬ 
duced  rank  quantizers  in  full  canonical  coordinates.  The 


5.  CONCLUDING  REMARKS 

Assuming  the  AWN  model  for  quantization,  we  have  proved 
that  the  right  coordinate  systems  for  quantization  are  the 
systems  of  half  and  full  canonical  coordinates.  Half  canon¬ 
ical  coordinates  minimize  the  trace  and  full  canonical  co¬ 
ordinates  minimize  the  determinant  of  the  error  covariance 
matrix.  It  has  been  proved  earlier  [3,  p.  330],  [1],  that 
canonical  coordinates  are  optimum  for  rank  reduction,  as 
well.  Together  with  our  results,  this  means  that  we  can  first 
choose  a  coordinate  system  and  then  decide  how  many  bits 
to  spend  on  how  many  components.  More  details  can  be 
found  in  the  journal  version  [5]  of  this  paper. 
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Abstract 

We  study  the  problem  of  sensor-scheduling  for  target  tracking — to  determine  which  sen¬ 
sors  to  activate  over  time  to  trade  off  tracking  performance  with  sensor  usage  costs.  We 
approach  this  problem  by  formulating  it  as  a  partially  observable  Markov  decision  process 
(POMDP),  and  develop  a  Monte  Carlo  solution  method  using  a  combination  of  particle 
filtering  for  belief-state  estimation  and  sampling-based  Q-value  approximation  for  looka¬ 
head.  To  evaluate  the  effectiveness  of  our  approach,  we  consider  a  simple  sensor-scheduling 
problem  involving  multiple  sensors  for  tracking  a  single  target. 


1  Introduction 

One  of  the  key  problems  in  the  design  and  operation  of  modern  tracking  systems 
is  sensor  scheduling,  which  aims  to  improve  tracking  system  performance,  utilize  lim¬ 
ited  system  resources  more  effectively  and  efficiently,  and  offer  much  faster  adaptation 
to  changing  environments  [1].  The  basic  problem  is  to  select  which  sensors  to  acti¬ 
vate  for  target  tracking  over  time  to  trade  off  tracking  performance  with  sensor  usage 
costs. 
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A  number  of  papers  have  addressed  the  sensor-scheduling  problem  for  different 
applications.  In  [2-4],  this  problem  is  formulated  as  an  optimization  problem  to  min¬ 
imize  instantaneous  estimation  errors  and/or  maximize  information  gains.  In  such 
schemes,  however,  long-term  performance  is  not  considered,  which  leads  to  myopic 
sensor-scheduling  policies. 

To  incorporate  long-term  performance  measures,  the  sensor-scheduling  problem 
may  be  framed  as  a  stochastic  optimal  control  problem.  In  this  case,  a  partially  ob¬ 
servable  Markov  decision  process  (POMDP)  framework  is  a  natural  approach,  which 
is  able  to  address  both  short-term  and  long-term  benefits  and  costs  [5-8]. 

Within  the  POMDP  framework,  the  process  measured  by  sensors  (target  posi¬ 
tion,  velocity,  etc.)  is  a  Markov  process,  and  sensor  scheduling  is  based  on  recursively 
estimating  and  updating  the  belief  state,  the  posterior  distribution  of  the  process  given 
the  history  of  the  sensor  measurements  and  the  sensor-scheduling  actions.  In  some 
situations,  the  process  dynamics  and  measurements  can  be  represented  as  linear  Gaus¬ 
sian  state-space  models,  in  which  case  the  belief  state  can  be  calculated  analytically 
by  Kalman  filtering.  In  other  situations  [7,8],  the  process  dynamics  can  be  modeled 
as  a  partially  observable,  finite  state  Markov  chain,  and  it  is  also  feasible  to  obtain 
an  analytic  solution  for  the  belief  state  using  a  hidden  Markov  model  (HMM)  filter. 
In  practice,  however,  process  dynamics  and  observations  can  be  very  complicated — 
usually  nonlinear,  non-Gaussian,  and  high-dimensional — which  precludes  analytic  so¬ 
lutions. 

In  this  paper,  we  propose  to  explore  the  sensor-scheduling  problem  within  a 
POMDP  framework  but  without  relying  on  analytic  expressions  for  belief  states.  In¬ 
stead,  we  develop  a  Monte  Carlo  solution  approach  that  combines  particle  filtering  for 
non-Gaussian,  nonlinear  belief-state  estimation,  and  a  Q- value  approximation  method 
for  solving  the  POMDP  via  “lookahead.”  Our  goal  is  to  design  a  policy  for  sensor 
scheduling  to  manage  (simultaneously)  tracking  performance  and  sensor  usage.  The 
Q-value  approximation  method  aims  to  deal  with  the  issue  that  the  state  space  for 
the  POMDP  model  can  be  very  large,  practically  ruling  out  the  use  of  methods  that 
rely  on  direct  reasoning  with  the  state  space  in  computing  an  optimal  policy. 

Particle  filtering  is  a  promising  Monte  Carlo  method  for  posterior-distribution 
estimation,  working  with  random  samples  drawn  from  the  process  distribution.  The 
Q-value  approximation  method  involves  computing,  for  each  candidate  action  to  be 
taken,  a  value  of  the  “cost”  of  that  action  (the  Q- value)  and  selecting  the  action  with 
minimum  cost.  Our  approach  blends  the  two  separate  techniques  in  a  natural  way.  The 
particle  filter  provides  the  Q-value  approximation  method  a  set  of  states  (particles) 
to  initiate  the  evaluation  processes,  and  in  return  the  Q-value  approximation  method 
delivers  updated  actions  for  the  particle  filtering  belief-state  estimation. 

Our  approach  benefits  from  several  appealing  features.  First,  it  can  take  both 
long-term  and  short-term  costs  and  benefits  into  account.  Second,  because  it  is  a 
Monte  Carlo  method,  which  does  not  rely  on  analytical  tractability,  it  is  straightfor¬ 
ward  in  our  approach  to  incorporate  sophisticated  models  for  sensor  behavior  and 
target  dynamics.  In  particular,  the  model  we  introduce  in  Section  3  includes  a  non- 
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linear  observation  map  and  sensors  with  blind  zones. 

Our  main  contribution  here  is  to  combine  POMDPs  with  particle  filtering  for 
sensor  scheduling.  The  formulation  of  the  sensor-scheduling  problem  as  a  POMDP  is 
itself  not  new  (see,  e.g.,  [7,8]).  However,  our  use  of  Monte  Carlo  sampling  methods  for 
Q- value  approximation  is  new  in  the  sensor- scheduling  context.  Moreover,  in  previous 
work,  the  belief  state  was  estimated  and  updated  using  either  Kalman  filtering  or 
HMM  filtering  (e.g.  [7,8]).  Both  Kalman  filtering  and  HMM  filtering  are  inadequate 
for  nonlinear,  non-Gaussian  state  estimation.  Recently,  the  authors  of  [4,9,10]  have 
also  studied  the  use  of  particle  filtering  for  sensor-scheduling.  However,  in  [4,9,10]  the 
problem  was  not  formulated  as  a  POMDP. 

To  evaluate  the  effectiveness  of  our  approach,  we  study  a  simple  sensor-scheduling 
problem  involving  multiple  sensors  for  tracking  a  single  target.  In  particular,  we  ex¬ 
plore  the  tradeoff  between  tracking  performance  and  sensor  usage  costs.  Our  simula¬ 
tion  results  demonstrate  that  our  method  of  combining  particle  filtering  with  Q- value 
approximation  is  effective  in  calculating  a  sensor-scheduling  policy  that  systematically 
allows  trading  off  tracking  performance  for  sensor  usage  costs. 


2  Preliminaries 

We  begin  with  a  brief  description  of  POMDPs;  we  follow  the  treatment  in  [11], 
A  POMDP  is  specified  by  state  space  S,  action  space  U,  observation  space  Z,  state 
transition  law  K(s'\s,u)  (s  G  S  and  u  G  U),  observation  map  L(z\u,  s )  (z  G  Z),  initial 
state  distribution  p0,  and  one-step  cost  function  g(s,u).  The  POMDP  generates  a 
sequence  of  states  that  evolves  as  follows.  At  time  k  —  0,  the  system  starts  at  the 
initial  (unobservable)  state  s0  with  the  given  initial  distribution  p0.  If  at  time  k,  the 
state  of  the  system  is  sk  and  control  iq.  (chosen  from  a  set  of  available  actions  U(sk)) 
is  applied,  a  cost  g(sk,Uk )  is  incurred  and  the  system  moves  to  state  sk+ 1  according 
to  the  transition  law  K(sk+i\sk,Uk)]  observation  Zk+ i  is  generated  according  to  the 
observation  map  L(zk+i\uk,  Sfc+i)- 

A  policy  for  the  POMDP  can  be  defined  as  a  sequence  n  =  {/q,(p(sfc  [/*,))} 
such  that,  for  each  k,  Pk(p(sk \h))  is  a  state-feedback  map  that  specifies  an  ac¬ 
tion  Uk  on  U  depending  on  the  belief  state  p(sk\Ik),  the  posterior  probability  dis¬ 
tribution  of  state  sk  conditioned  on  the  observable  history  Ik  (J0  :=  (p0)  and 
h  ■—  (Po;  uo j  Zu  •  •  • )  zk-i,  uk-i,Zk)  f°r  k  >  1).  We  can  track  belief  states  in  a  POMDP 
using  Bayes’  rule. 

Let  Jh(p(s0\I0))  =  E  (Ek=o  f  g(sk,uk)dp(sk\Ik)j  =  E  (j2k=o  ^(sfc,  uk))  be  the 
expected  total  cost  over  a  horizon  of  H  time  steps,  where  the  expectation  is  taken 
over  all  possible  belief-state  sequences.  Our  objective  is  to  find  a  policy  n*  = 
K(p(s0|/o)),^(p(si|/i)),...}  that  minimizes  Jh(p(so\Io))-  We  denote  the  associ¬ 
ated  optimal  value  (a  function  of  the  initial  belief  state)  by  J^-(p(so|/o))-  We  write 
Qh(p(s\I),u)  =  /  g(s,  u)  d(p(s\I))  +  E[J^_1(p(s'\I'))}  (called  the  Q-value),  where 
Jh-i(.p(s'\I'))  is  the  optimal  value  over  77  —  1  time  steps  starting  at  the  “next” 
belief  state  p(s'\I').  It  turns  out  that  an  optimal  policy  satisfies  p*k(p(sk\Ik))  = 
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argmmUk<EU(Sk)  QH-k(p(sk\Ik),uk),  where  J*H(p(s0\I0))  =  minue[/(so)  QH(p(s0\I0),  u) 
(also  called  Bellman’s  optimality  equation  for  POMDP);  see  [11]  for  further  details. 

We  assume  H  to  be  very  large,  so  that  the  optimal  policy  can  be  assumed  to 
be  stationary.  In  this  case,  the  optimal  policy  is  approximated  by  assuming,  at  each 
time,  that  the  remaining  horizon  is  H  steps,  so  that  the  optimal  action  at  time  k  can 
be  taken  to  be  u*k  =  aigminu£U^Sk^  QH(p(sk\Ik),u).  This  approach  is  called  receding 
horizon  control ,  common  in  the  optimal  control  literature  (e.g.,  [12]).  Note  that  the 
resulting  control  law  is  simply  this:  given  the  current  state,  choose  the  action  that 
minimizes  the  Q- value  at  that  state.  Because  the  Q-value  of  an  action  summarizes  the 
future  impact  of  taking  that  action,  our  control  approach  is  also  called  “lookahead.” 


3  Sensor-Scheduling  for  Target- Tracking  Problem  Formulation 

In  our  context,  there  are  many  sensors  distributed  in  a  sensor  held  to  track  tar¬ 
gets,  and  a  global  processor  processing  data  from  all  sensors.  For  homogeneous  sensors, 
the  tracking  accuracy  can  be  improved  through  data  fusion  of  multiple  sensors.  How¬ 
ever,  the  larger  the  number  of  sensors,  the  more  resources  they  consume.  Therefore  it 
is  necessary  to  select  an  appropriate  number  of  sensors  to  balance  between  tracking 
accuracy  and  sensor  usage.  For  heterogeneous  sensors,  their  sensor-usage  characteris¬ 
tics  and/or  the  quality  of  the  data  they  transmit  to  the  global  processor  are  different. 
In  this  case,  data  from  one  sensor  can  be  used  to  complement  the  data  from  other 
sensors  in  order  to  obtain  broader  coverage  and  more  accurate  target-state  estimates. 
How  to  appropriately  select  the  right  sensor  combination  to  reach  a  tradeoff  between 
tracking  accuracy  and  sensor  usage  is  a  key  task  of  sensor  scheduling. 

We  now  describe  a  formulation  of  the  sensor-scheduling  problem  within  a 
POMDP  framework.  Although  our  approach  is  fairly  general,  for  ease  of  presenta¬ 
tion  we  make  some  simplifying  assumptions: 

•  We  only  track  a  single  target; 

•  The  target  states  to  be  tracked  consist  of  its  two-dimensional  position  and  velocity; 

•  There  are  M  sensors  located  at  fixed  positions  to  measure  the  following  parameters: 
range,  range  rate,  and  azimuth  of  the  target; 

•  At  each  time  step,  only  one  sensor  is  selected  (activated). 

We  consider  an  aggregate  tracking  system  state  vector  sk  =  [ tk ,  ak]T,  where  A  and 
ak  are  summaries  for  the  target  features  and  the  sensor  operations,  respectively,  in  the 
tracking  system  sufficient  to  characterize  the  objectives  and  potential  actions.  Specif¬ 
ically,  sk  =  [xk,xk,yk,yk,ak'1} . . .  ,akM]T,  where  xk  and  yk  are  the  target-position 

^  .y.  V  V 

^ fc  O'k 

Cartesian  coordinates,  xk  and  yk  are  velocities,  and  ak^m  G  {0, 1}  is  the  activity 
status  of  sensor  m,  m  =  1 , . . . ,  M.  The  action  space  U  is  {1  ,...,M},  and  action 
uk  G  {1,  •  •  • ,  M}  represents  the  sensor  selected  at  time  k.  The  observation  at  time 
k  is  zk  =  [dk,rk,9k,rk]T,  where  dk  G  {0,1}  represents  successful  detection,  and  rk, 
9k,  and  rk  are  range,  azimuth,  and  range-rate  measurements  of  the  target  using  the 
selected  sensor  uk  at  time  k  (if  dk  =  0,  then  rk,  6k ,  and  rk  are  ignored). 
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In  our  formulation,  the  transition  law  K(s'\s,u)  and  observation  map  L(z\u,s) 
are  defined  by  the  state  equation  and  the  observation  equation  as  follows: 


®fc+l  f(^k:  ^ki  Vk) 
UJk) , 


(1) 

(2) 


where  /  and  h  represent  the  state  dynamics  and  the  observation  map,  respectively,  and 
z/fc  and  LUk  represent  the  randomness  in  state  transitions  and  observations,  respectively. 
We  assume  that  {u^}  and  {cu*,}  are  mutually  independent  i.i.d.  random  variables  with 
distributions  pu  and  p u).  Then,  K(ds'\s,  u)  =  /  ldsr(v')f(s,  u,  v')  pu(dv')  and  L(dz\u,  s ) 
=  /  ldz(oj')h(s,uj')  pw(du)')  (1a  represents  the  indicator  function  of  A). 

We  first  describe  the  state  dynamics  /.  Because  the  state  vector  Sk  is  com¬ 
posed  of  two  segments,  the  state  dynamics  can  be  decomposed  in  the  following  way: 
/(-Sfc,  Uk,  vk)  =  [/4(tfc,  ul),  fa(uk)}T,  where  vk  represents  target  motion  uncertainties. 
The  form  of  fa(uk)  is  clear:  all  its  components  are  0  except  for  the  component  cor¬ 
responding  to  the  selected  sensor  Uk,  where  it  is  1.  The  specific  form  of  f1  represents 
the  model  for  the  motion  of  the  target.  As  an  example,  the  particular  model  used  in 
our  simulation  experiments  is  as  follows  (taken  from  [13]): 


*£fc+l 

1  Ts  0  0 

Xk 

rp2 

Ll.  o 

2  u 

&k+l 

n+  /  +  \ 

0  10  0 

xk 

Ts  0 

t fc+1 

=  /  (4,  vk)  = 

+ 

Tl 

Vk+ 1 

0  0  1  Ts 

Vk 

0  Ts 
w  2 

_  Vk+i  _ 

0  0  0  1 

Vk  _ 

o 

vt 


(3) 


where  Ts  is  the  sampling  interval  (assumed  constant),  and  vk  and  vk  are  independent 
noise  processes  with  zero  mean  and  variances  and  a2y. 

Next,  we  describe  the  observation  map  h,  which  represents  how  the  sensor  mea¬ 
surements  depend  on  the  state.  The  particular  form  of  h  depends  on  the  type  of 
sensors  being  considered.  For  example,  in  our  simulation  experiments,  we  follow  the 
radar  model  of  [13],  where  each  sensor  has  a  Doppler  blind  zone  (as  is  the  case  with 
a  CW  or  pulse  Doppler  radar).  The  probability  of  detection  according  to  this  model 
is  (for  a  particular  sensor  m): 


Pd(m ),  if 


(xk-spx(m))-xk+{yk-spy(m))-yk 
\J  ( xk-spx(m))2+(yk-spy{in ))2 


>  B0(m), 


0,  otherwise, 


(4) 


where  Pd(m)  £  (0, 1],  B0(m )  is  the  limit  of  the  Doppler  blind  zone  for  sensor  m,  and 
spx(rri)  and  spy(m )  are  the  Cartesian  coordinates  of  the  fixed  position  of  sensor  m. 
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For  this  example,  the  observation  map  is  given  by 


k ) 


hdk(sk,^k) 

\J{xk  -  spx(m(ak)))2  +  (yk  -  spy(rn(ak)))2  +  urk(m(ak)) 

(xk-spx(m(ak)))-xk+(yk-spy{m{ak)))-yk  .  UJrlmla  \\ 

\J  ( xk-spx{m{ak)))‘2+(yk-spy{m(ak )))2  k  k 


where  hk(sk,ouk )  is  given  by, 


(5) 


L{^>Pd(m(afc))} 


,  if 


(xk-spx(m(ak)))-xk+(yk-spy(rn(ak)))-yk 
\J  ( xk-spx(m(ak)))2+{yk-spy(m(ak )))2 


0, 


otherwise, 


>  S0(m(afc)), 


(6) 


m(ak )  is  the  currently  selected  sensor,  is  uniformly  distributed  over  (0,1),  and 
oul(m),  and  ojrk{m)  represent  independent  observation  noise  processes  with 

zero  mean  and  variances  cr2(m),  cr^(m),  and  <r?(m),  respectively,  m  =  1, . . . ,  M. 

The  one-step  cost  function  g(sk,uk )  is  an  integrated  metric  that  accounts  for 
the  target-tracking  performance  and  the  sensor-usage  costs  at  time-step  h.  As  an 
example,  in  our  simulation  experiments  we  use 


M 

g(sk,uk)=E  ||4  -  4||2  +  y  (Case  •  l{Ufc=m}  +  Crt  ■  (aktTn  -  l{Ufc=m})2)  ,  (7) 

m=l 


where  tk  is  the  estimated  value  of  state  segment  4  (we  use  the  mean  of  p(4|7fc)  as 
4),  and  c“sa9e  and  cs^irt  are  the  unit  power-consumption  cost  and  the  unit  sensor 
start-up  power-consumption  cost  for  sensor  m,  respectively.  The  parameters  c^a9e 
and  c"  which  control  the  tradeoff  between  tracking  error  and  sensor  usage  costs, 
are  assumed  to  be  user-specified.  Certainly  one  can  imagine  setting  up  a  system  to 
tune  such  parameters  on-line,  for  example  in  response  to  measurements.  However,  the 
criterion  that  drives  this  kind  of  tuning  must  then  be  specified;  again,  we  consider  such 
criteria  to  be  user-specified.  But  then  this  scenario  falls  right  back  into  our  original 
framework,  except  with  a  more  complicated  objective  function. 

So  far  we  have  defined  the  state  vector  sk,  the  action  ukl  the  observation  vector 
zk,  the  state  transition  law  K(s'\s,  u ),  the  observation  map  L(z\u,  s ),  and  the  one-step 
cost  g(sk,  uk)  for  the  sensor-scheduling  POMDP  model.  Next  we  show  how  to  obtain 
an  approximate  optimal  policy  to  schedule  sensors  for  target  tracking. 


4  POMDP  Solution  via  a  Combination  of  Particle  Filtering  and  Q- value 
Approximation 

In  this  section,  we  present  our  control  approach  based  on  “lookahead”  for  solving 
the  sensor-scheduling  POMDP,  which  leads  to  an  approximate  optimal  policy.  The 
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policy  specifies,  for  each  possible  belief  state,  the  (approximate)  best  sensor-activating 
action  to  implement  according  to  the  objective  function,  where  the  belief  state  here  is 
the  posterior  distribution  of  the  tracking  system  state  conditioned  on  the  observable 
history  at  each  time. 

Recall  from  Section  2  that  in  the  “lookahead”  approach,  the  action  is  chosen  at 
each  decision  time  by  minimizing  the  Q- value  for  a  moving  horizon  into  the  future. 
To  be  precise,  u*k  =  arg minue[/(Sfc)  =  Q# (p(s*;|/fc), M),  where  p(sk\h)  is  the  current 
belief  state. 

Note  that  the  action  u\  to  be  chosen  depends  on  belief  state  p(sk\h)-  Under 
certain  circumstances,  analytical  expressions  of  belief  states  can  be  derived.  For  in¬ 
stance,  if  the  observable  history  Ik  is  linear- Gaussian  with  respect  to  the  tracking 
system  state  Sk,  we  can  derive  an  analytical  expression  to  recursively  estimate  the 
posterior  distribution  with  a  Kalman  filter.  Alternatively,  if  the  tracking  system  state 
can  be  modeled  as  a  Markov  chain,  as  in  [7,8],  an  HMM  filter  can  be  used  for  analyti¬ 
cal  belief-state  estimation.  In  practice,  however,  the  relationship  between  the  tracking 
system  state  and  the  observable  history  can  be  very  complex — usually  nonlinear,  non- 
Gaussian,  and  high  dimensional — which  makes  it  impossible  to  obtain  an  analytical 
solution. 

To  overcome  this  difficulty,  we  describe  a  novel  general  approach  that  combines 
two  techniques:  particle  filtering  for  belief-state  estimation  and  Q-value  approxima¬ 
tion,  in  which  we  represent  the  belief  state  by  a  cloud  of  particles.  The  Q-value 
approximation  method  addresses  the  issue  that  the  state  space  in  practice  can  be 
very  large  (especially  in  light  of  the  need  to  represent  a  belief  state),  precluding  the 
use  of  methods  that  rely  on  direct  reasoning  with  the  state  space  in  computing  an 
optimal  policy. 

4-1  Particle  Filtering  for  Belief-State  Estimation 

Particle  filtering  is  a  sequential  Monte  Carlo  method  for  on-line  learning  within 
a  Bayesian  framework  [14] .  The  method  works  with  random  samples  drawn  from  the 
underlying  distribution,  and  is  computationally  realizable  even  for  high-dimensional 
problems.  Particle  filtering  allows  the  use  of  realistic  models,  incorporation  of  a  priori 
information,  and  integration  with  decision  processes. 

In  most  particle-filtering  formulations  [14],  the  state  equation,  observation  equa¬ 
tion,  and  the  initial  state  probability  are  described  by  Sk+i  =  /(sfc,  vk),  Zk  =  h(sk,  c Ok), 
and  p(so)  =  po ,  respectively,  and  the  goal  is  to  estimate  recursively  the  posterior  dis¬ 
tribution  p(sk\zi,  Z2,  ■  •  •  ,Zk )•  However,  in  our  sensor-scheduling  problem,  we  have  a 
control  variable  Uk  in  the  state  equation  (1),  and  our  goal  is  to  estimate  the  pos¬ 
terior  distribution  p(sfe|/fc)  =  p(sk\po,  «o,  zi,  ■  •  • ,  zk-i>uk-i>  zk )•  Particle  filtering  with 
control  variables  has  been  discussed  recently  in  [15],  though  not  within  a  POMDP 
framework. 

We  can  write  the  approximation  of  the  posterior  distribution  p(sk\h)  by  a  set  of 
samples  or  particles:  pN(dsk\Ik)  =  J2iLi  wk^sptk{dsk),  where  SSptk  denotes  the  Dirac- 
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delta  mass  located  at  spt)k,  N  is  the  number  of  particles,  and  are  the  normalized 
“importance”  weights. 

Following  [14],  our  particle  hltering  algorithm  to  estimate  p(sk\Ik)  is  as  follows: 

(1)  Initialization ,  k  —  0. 

•  For  i  —  1, . . . ,  N,  sample  sj^  ~  p(s0),  set  Wq  >  =  1/N,  and  set  k  —  1. 

(2)  Importance-sampling  step. 

•  For  i  =  1 , ,N,  sample  s^  rsj  Q(sk\sk-i,h),  where  q(sk\sk-i,  Ik)  is  a  prese¬ 
lected  “proposal”  function. 

•  For  i  —  1, . . . ,  N,  update  the  importance  weights 


~(i)  K{s 
=  wk-i- 


(®) 

k 


sk-nuk-i)L(zk\uk-i, 


/ ~\x) 

•  Normalize  the  importance  weights  according  to  wk  — nr- 

>  w)J) 

(3)  Selection  Step. 

•  Resample  with  replacement  N  particles  (s i  —  1,  . .  .,7V)  from  the  set  (s^  j 
i  =  1, . . . ,  N )  according  to  the  normalized  importance  weights. 

•  Set  k  <—  k  +  1  and  go  to  step  2. 

Note  that  the  main  difference  between  our  algorithm  and  the  standard  algorithm  is 
that  the  importance-sampling  step  in  our  algorithm  involves  the  probability  distribu¬ 
tion  of  observation  conditioned  on  the  action. 

In  the  importance-sampling  step,  we  often  choose  either  p(sfc|so:fc-i,  Ik)  or 
p(sk\sk-i,uk-i)  as  the  “proposal”  function.  In  the  selection  step,  many  schemes  have 
been  proposed,  such  as  residual  sampling,  systematic  sampling,  and  Markov  chain 
Monte  Carlo  (MCMC)  [14]. 

For  the  example  in  our  simulation  experiments,  we  use  a  special  particle  filter 
algorithm  for  belief  state  estimation  that  exploits  a  priori  knowledge  of  the  sensor 
blind  zones.  The  basic  idea  is  to  have  one  set  of  particles  for  sensor  blind  zones,  and 
the  other  set  of  particles  for  the  target.  Though  this  idea  is  similar  to  that  in  [13], 
our  particle  filter  is  designed  for  belief-state  estimation  rather  than  ordinary  posterior 
distribution  estimation.  We  omit  the  details  for  brevity. 


4-2  Q-value  Approximation  with  Particle  Filtering 


Recall  that  according  to  the  “lookahead”  procedure,  the  action  at  time  k  is  chosen 
as  u*k  =  arg minu6[/(Sfc)  =  QH(p(sk\Ik),u).  We  now  describe  how  we  approximate  the 
Q- values  for  the  candidate  actions.  The  need  to  approximate  the  Q-values  stems 
from  the  intractability  of  computing  precise  Q-values  due  to  the  excessively  huge 
state  space  in  practice. 

Several  Q-value  approximation  methods  have  been  proposed  for  large  state-space 
MDPs  [16,17].  Here  we  consider  one  particular  Q-value  approximation  method — policy 
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rollout  [16].  In  the  policy-rollout  method,  we  estimate  the  Q- value  for  each  belief 
state  and  each  action  by  averaging  the  evaluated  accumulated  costs  from  several 
Monte  Carlo  simulation  runs  using  a  given  base  policy.  This  approximation  gives  us 
an  upper  bound  on  the  true  Q-value. 

Because  we  represent  the  belief  state  p(sk\h)  using  a  cloud  of  particles,  our  de¬ 
value  approximation  method  can  take  advantage  of  this  representation  in  initiating 
Monte  Carlo  simulation  runs  of  the  base  policy.  Specifically,  we  start  each  simulation 
run  at  one  particle  for  the  belief  state,  apply  action  u  for  the  first  time  step,  and 
apply  a  given  base  policy  7 q,  for  the  remaining  time  steps.  This  allows  us  to  generate 
N  simulation  runs,  one  for  each  of  the  N  different  particles.  We  estimate  the  de¬ 
value  for  each  belief  state  and  each  action  by  averaging  the  evaluated  accumulated 
costs  from  these  N  Monte  Carlo  simulation  runs.  The  resulting  rollout  policy  is  the 
action  minimizing  QH(pN(sk\Ik),u)  =  ^  +  ?h- i(^W)}>  where  is 

the  state  after  the  first  time  step  (for  particle  i).  Usually,  we  choose  as  the  base  policy 
a  heuristic  policy  that  is  known  to  be  reasonable.  The  choice  of  a  base  policy  may 
have  a  significant  impact  on  the  performance  of  the  rollout  policy.  For  more  details 
on  how  properly  to  choose  a  base  policy,  see  [16]. 

4-3  On  the  Computational  Burden  of  Our  Approach 

A  primary  concern  in  applying  sophisticated  methods,  such  as  ours,  is  the  com¬ 
putational  burden  involved.  It  is  instructive  to  compare  the  computational  burden  of 
our  scheme  with  that  of  conventional  myopic  schemes,  such  as  CPA  (Closest  Point  of 
Approach),  which  we  will  use  in  our  simulation  experiments  for  comparison.  As  the 
basis  of  this  comparison,  we  first  note  that  the  computational  requirements  in  our 
approach  stem  from  three  sources: 

(1)  the  particle-filter  algorithm  for  belief-state  estimation, 

(2)  the  selection  of  an  action  with  minimum  Q-value,  and 

(3)  simulation  runs  for  Q-value  calculations. 

The  computational  burden  involved  in  item  1  above  is  required  of  any  tracking 
method  that  uses  particle  filters,  including  myopic  approaches  such  as  CPA.  The 
extent  of  this  burden  depends  on  the  number  of  particles  used  in  the  filter.  Of  course, 
with  additional  assumptions,  the  particle-filtering  approach  can  be  replaced  by  some 
other,  such  as  Kalman  filtering  (but  this  consideration  applies  to  both  our  approach 
and  conventional  approaches). 

Item  2  above,  which  involves  solving  an  optimization  problem,  is  common  to 
both  our  approach  and  myopic  approaches.  The  difference  between  the  two  is  that  the 
objective  function  being  optimized  in  our  approach  is  given  by  the  Q-values,  whereas 
in  myopic  approaches  the  objective  function  is  some  given  (myopic)  criterion.  In  the 
case  of  CPA,  this  function  is  given  by  the  distance  between  sensors  and  the  estimated 
target  position.  The  fact  that  our  approach  involves  solving  an  optimization  problem 
at  every  step,  just  like  in  myopic  approaches,  is  a  desirable  consequence  of  Bellman’s 
optimality  equation  (see  Section  2) .  In  either  case,  the  computational  burden  required 
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for  this  optimization  depends  on  the  size  of  the  search  space  (the  number  of  feasible 
actions). 

The  third  source  of  computational  requirements  in  our  approach  is  that  of  eval¬ 
uating  Q-values  via  Monte  Carlo  sampling.  This  is  a  computational  burden  that  our 
method  has  to  bear,  but  one  that  is  not  present  in  conventional  myopic  methods 
involving  “simple”  objective  functions  (such  as  CPA).  A  distinct  advantage  of  Monte 
Carlo  sampling  is  that  we  can  incorporate  sophisticated  objective  functions,  taking 
into  account  factors  that  are  not  possible  to  account  for  analytically.  Hence,  my¬ 
opic  methods  with  complicated  objective  functions  that  are  impossible  to  evaluate 
analytically  may  also  take  advantage  of  Monte  Carlo  methods.  In  this  case,  the  com¬ 
putational  burden  becomes  comparable  to  that  of  our  method.  Similarly,  in  (rare) 
situations  where  the  Q-values  in  our  method  can  be  computed  analytically,  the  bur¬ 
den  of  Monte  Carlo  sampling  may  be  ameliorated.  In  either  case,  the  computational 
requirement  in  Monte  Carlo  sampling  depends  on  the  length  of  the  simulation  runs 
and  the  number  of  samples  needed  for  averaging.  By  controlling  these  quantities, 
we  can  trade  off  the  performance  of  the  resulting  policy  for  reduced  computational 
complexity. 


5  Simulation  Experiments 

We  evaluated  our  approach  via  simulation  experiments.  In  our  experiments,  the 
target-motion  model  is  given  by  equation  (3),  with  ox  =  oy  =  g  T”1/2  (g  is  the  ac¬ 
celeration  of  gravity),  and  there  are  M  —  4  sensors  available  (sensor  A,  B,  C,  and  D 
with  m  =  1,  2,  3,  and  4),  with  the  observation  map  (5).  The  other  parameters  used  in 
our  experiments  are  as  follows:  the  sampling  interval  is  Ts  =  2  sec;  the  locations 
of  sensors  are  {(spx(l),spy(l)),(spx(2),spy(2)),(spx(3),spy(3)),(spx(4),spy(4:))}  = 
{(0,  0),  (—10,  30),  (0,  60),  (10,  30)}  (km,  km);  the  limits  of  the  Doppler  blind-zone  for 
all  sensors  are  Bq  =  100  km/h;  and  the  probabilities  of  detection  for  the  sensors  are 
all  equal:  P,i  =  0.9. 

We  compare  the  performance  of  our  rollout  policy  to  the  commonly  used  CPA 
(Closest  Point  of  Approach)  policy.  CPA,  selecting  the  closest  sensor  to  the  target 
estimate,  is  a  “greedy”  approach  that  does  not  take  into  account  the  sensor  power 
consumption  or  the  sensor  error  statistics.  We  consider  two  scenarios.  In  scenario  1, 
we  assume  that  one  of  the  sensors  consumes  much  more  energy  than  the  other  three, 
and  that  the  error  statistics  for  all  the  sensor  measurements  are  the  same:  or  =  250 
m,  (7+  =  3  m/s,  and  Og  =  1°.  Because  our  rollout  policy  takes  this  information  into 
account,  it  can  avoid  selecting  the  more  costly  sensor  at  appropriate  times.  Figures  1 
and  3(a)  illustrate  the  true  trajectories  of  the  target,  the  estimates  of  the  target 
positions,  and  the  sequences  of  the  selected  sensors  using  the  CPA  policy  and  the 
rollout  policy,  respectively.  Figure  2(a)  shows  the  accumulated  tracking  error  and 
power-consumption  cost  from  the  CPA  and  rollout  policies.  Here,  sensor  C  is  the 
sensor  that  consumes  much  more  energy  than  the  other  three. 

In  scenario  2,  we  assume  that  all  sensors  have  the  same  power  consumption  but 
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sensor  B  has  much  smaller  measurement  noise  than  others  with  its  error  statistics 
being:  ar  =  50  m,  o>  =  0.6m/s,  and  ae  =  0.2°.  As  shown  in  Figure  3(b),  our  rollout 
policy  tends  to  select  sensor  B,  which  leads  to  a  lower  tracking  error  than  that  of  CPA 
(see  Figure  2(b)).  Here,  the  CPA  policy  is  the  same  as  in  scenario  1,  since  sensor  B 
has  never  been  selected. 

In  our  experiments,  it  is  not  surprising  that  the  rollout  policy  outperforms  CPA. 
What  is  significant  here  is  that  the  rollout  policy  systematically  and  automatically 
trades  off  tracking  performance  and  sensor-usage  costs.  In  scenario  1,  we  sacrifice 
some  tracking  performance  for  large  reductions  in  sensor  usage  costs.  In  scenario  2, 
we  reduce  the  tracking  error  with  no  increase  in  sensor-usage  costs. 


6  Conclusion  and  Future  Work 

In  this  paper,  we  formulated  the  problem  of  sensor  scheduling  for  target  track¬ 
ing  as  a  POMDP,  and  proposed  a  general  approach  that  combines  particle  filtering 
and  Q- value  approximation  for  solving  the  POMDP.  As  a  particular  instance  of  this 
approach,  we  implemented  policy  rollout  with  particle  filtering.  Our  experiments  on 
a  simple  sensor- scheduling  problem  involving  multiple  sensors  for  tracking  a  single 
target  illustrates  the  effectiveness  of  this  general  approach. 

Applying  our  approach  to  more  complicated  sensor-scheduling  problems  is  part  of 
our  ongoing  work.  Specifically,  we  are  currently  investigating  sensor  scheduling  prob¬ 
lems  with  multiple  targets  and  the  selection  of  multiple  sensors.  For  such  multiple- 
target  multiple-sensor  scenarios,  the  state  dynamics  and  observation  map  need  to 
be  extended  accordingly.  For  the  particle  filter,  we  have  some  options:  we  can  either 
construct  a  single  particle  filter  for  all  targets  or  construct  one  particle  filter  for  each 
target.  The  particle- filter  algorithm  and  the  Q-value  approximation  procedure  remain 
the  same,  except  with  higher  dimensions.  The  main  additional  feature  needed  in  the 
multiple-target  case  is  to  incorporate  a  data-association  module  to  decide  which  tar¬ 
get  is  associated  with  each  observation.  Data  association  algorithms,  such  as  JPDA 
(Joint  Probabilistic  Data  Association),  have  been  studied  extensively.  Our  concern 
is  simply  to  incorporate  such  algorithms  into  our  approach.  This  turns  out  to  be 
straightforward  for  the  case  of  JPDA.  For  extensions  of  our  approach  involving  the 
selection  of  multiple  sensors,  we  also  need  to  do  sensor  data  fusion.  This  too  is  an 
area  with  an  extensive  literature  from  which  to  draw. 

We  can  include  time- varying  and  frequency- varying  jamming  sources  into  the 
state  model  by  representing  unobservable  jamming  intensities  as  additional  state  com¬ 
ponents.  Similarly,  we  can  incorporate  measurements  of  jamming  intensities  into  the 
observation  model.  This  will  enable  us  to  deal  with  jamming,  but  will  also  impose 
additional  costs.  We  expect  these  additional  costs  to  be  proportional  to  the  number 
of  jamming  parameters. 

A  phenomenon  known  as  “ghosting”  is  an  important  issue  in  multi-sensor  arrays 
where  two  or  more  radar  sensors,  each  limited  in  range  resolution,  interrogate  an 
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environment  containing  two  or  more  targets. 1  The  ghosting  phenomenon  can  cause 
a  multi-sensor  array  to  generate  apparent  target  detections  where  there  is  no  target. 
To  deal  with  ghosting,  a  number  of  techniques  have  been  proposed  [18].  Typically, 
a  “deghosting”  system  consists  of  an  angle-only  tracking  filter,  a  triangulation  range 
estimator,  and  a  hypothesis  test  to  determine  if  there  are  ghosts.  Our  method  can 
handle  ghosting  by  incorporating  a  particle  filter  to  estimate  true  angles,  angle  veloc¬ 
ities,  and  target  ranges  based  on  angle  measurements,  and  perform  hypothesis  testing 
to  eliminate  ghosts. 

Another  direction  for  future  research  is  to  apply  our  work  to  a  more  general 
sensor-management  problem,  which  includes  sensor  geometry  control,  sensor  band¬ 
width  allocation,  sensor  mode  switching,  as  well  as  sensor  scheduling.  We  may  also 
consider  integrating  constraints  into  our  POMDP  formulation,  such  as  battery  capac¬ 
ity,  load  balancing,  and  bandwidth  limits. 


References 

[1]  S.  Blackman,  R.  Popoli,  Design  and  Analysis  of  Modern  Tracking  Systems,  Artech 
House,  Boston,  1999. 

[2]  Y.  Oshrnan,  Optimal  sensor  selection  strategy  for  discrete-time  state  estimators,  IEEE 
Transactions  on  Aerospace  and  Electronic  Systems  30  (2)  (1994)  307-314. 

[3]  A.  Logothetis,  A.  Isaksson,  On  sensor  scheduling  via  information  theoretic  criteria, 
in:  Proceedings  of  the  American  Control  Conference,  San  Diego,  CA,  USA,  1999,  pp. 
2402-2406. 

[4]  C.  Kreucher,  K.  Kastella,  A.  O.  Hero  III,  Multi-target  sensor  management  using 
alpha-divergence  measures,  in:  Proceedings  of  First  IEEE  Conference  on  Information 
Processing  in  Sensor  Networks,  Palo  Alto,  2003. 

[5]  D.  A.  Castahon,  Approximate  dynamic  programming  for  sensor  management,  in: 
Proceedings  of  the  IEEE  Conference  on  Decision  and  Control,  San  Diego,  CA,  USA, 
1997,  pp.  1202-1207. 

[6]  A.  E.  B.  Lim,  V.  Krishnamurthy,  Risk-sensitive  sensor  scheduling  fo  discrete-time 
nonlinear  systems,  in:  Proceedings  of  the  IEEE  Conference  on  Decision  and  Control, 
Tampa,  FL,  USA,  1998,  pp.  1859-1864. 

[7]  J.  Evans,  V.  Krishnanruthy,  Optimal  sensor  scheduling  for  hidden  Markov  model  state 
estimation,  International  Journal  of  Control  74  (18)  (2001)  1737-1742. 

[8]  V.  Krishnamurthy,  Algorithms  for  optimal  scheduling  of  hidden  rnarkov  model  sensors, 
IEEE  Transactions  on  Signal  Processing  50  (6)  (2002)  1382-1297. 

[9]  A.  S.  Chhetri,  D.  Morrell,  A.  Papandreou-Suppappola,  Scheduling  multiple  sensors 
using  particle  filters  in  target  tracking,  in:  Proceedings  of  2003  IEEE  Workshop 
Statistical  Signal  Processing,,  2003,  pp.  549  -  552. 

1  We  thank  an  anonymous  reviewer  for  raising  this  issue. 


12 


109 


[10]  A.  S.  Chhetri,  D.  Morrell,  A.  Papandreou-Suppappola,  The  use  of  particle  filtering 
with  the  unscented  transform  to  schedule  sensors  multiple  steps  ahead,  in:  Proceedings 
of  2004  IEEE  International  Conference  on  Acoustics,  Speech,  and  Signal  Processing 
(ICASSP  ’04),  Vol.  2,  2004,  pp.  301-304. 

[11]  O.  Hernandez-Lerma,  Adaptive  Markov  Control  Processes,  Springer- Verlag,  New  York, 
1980. 

[12]  D.  Q.  Mayne,  H.  Michalska,  Receding  horizon  control  of  nonlinear  systems,  IEEE 
Transactions  on  Automatic  Control  35  (7)  (1990)  814-824. 

[13]  N.  Gordon,  B.  Ristic,  Tracking  airborne  targets  occasionally  hidden  in  the  blind  doppler, 
Digital  Signal  Processing  12  (2-3)  (2002)  383-393. 

[14]  A.  Doucet,  N.  de  Freitas,  G.  Gordon,  Sequential  Monte  Carlo  Methods  in  Practice, 
Springer- Verlag,  New  York,  2001. 

[15]  C.  Kwok,  D.  Fox,  M.  Meila,  Real-time  particle  filters,  Proceedings  of  The  IEEE  92  (3) 
(2004)  469-484. 

[16]  D.  P.  Bertsekas,  D.  A.  Castahon,  Rollout  algorithms  for  stochastic  scheduling  problems, 
Journal  of  Heuristics  5  (1999)  89-108. 

[17]  G.  Wu,  E.  K.  P.  Chong,  R.  L.  Givan,  Burst-level  congestion  control  using  hindsight 
optimization,  IEEE  Transactions  on  Automatic  Control  47  (6)  (2002)  979-991. 

[18]  R.  Yang,  G.  W.  Ng,  Deghosting  in  multi-passive  acoustic  sensors,  in:  Proceedings  of 
SPIE  -  The  International  Society  for  Optical  Engineering,  Multisensor,  Multisource 
Information  Fusion:  Architectures,  Algorithms,  and  Applications,  Vol.  5434,  2004,  pp. 
187-194. 


13 


110 


Fig.  1.  Sensor  selection  and  trajectory  of  the  CPA  policy  (scenarios  1  and  2). 


(a) 


(b) 


Fig.  2.  Comparison  of  accumulated  tracking  errors  and  accumulated  sensor 
power-consumption  costs  for  the  CPA  and  rollout  policies,  (a)  scenario  1;  (b)  sce¬ 
nario  2. 


(b) 


Fig.  3.  Sensor  selections  and  trajectories  of  the  rollout  policy  (a)  scenario  1;  (b)  scenario  2. 
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Abstract 

The  application  of  tree-based  classifiers  to  automatic  target  recognition  (ATR)  and  other  clas¬ 
sification  problems  is  studied.  The  described  work  builds  on  a  previous  DARPA  effort  in  which 
binary-tree  classifiers  were  applied  to  ATR  with  range-doppler  returns  as  input.  The  feature  vectors 
required  by  the  classifier  are  found  during  training  by  using  an  application  of  the  local  discriminant 
basis  (LDB)  wavelet-based  technology.  The  extension  of  the  LDB  binary-tree  ATR  method  to  ISP 
is  described  in  detail  in  this  document.  The  fundamental  idea  is  to  connect  a  set  of  binary-tree 
classifiers  in  such  a  way  that  decisions  at  ambiguous  nodes  are  resolved  by  requesting  the  best 
new  measurement  or  statistic  from  the  available  sensor(s).  In  this  way,  the  resulting  binary  hyper¬ 
tree  classifier  can  have  performance  comparable  to  the  supertree  classifier  that  is  informed  of  all 
sensor  measurements  but  with  an  average  input-data  requirement  that  is  substantially  reduced  by 
requesting  only  that  data  which  is  most  relevant  to  the  current  point  in  the  decision  process. 
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1  Introduction 

Automatic  accurate  determination  of  a  radar  target’s  type  finds  important  application  in  tactical 
military  situations.  It  may  also  have  application  in  commercial  aviation  and  search-and-rescue  op¬ 
erations.  More  generally,  automatic  target  recognition  (ATR)  is  a  specific  example  of  an  automatic 
M -ary  classification  problem  [  ] .  Such  problems  can  be  found  in  many  scientific  and  technological 
areas,  such  as  type  classification  of  RF  signals,  celestial  objects,  material  compositions,  tumors, 
heart  diseases,  bird  calls,  whale  songs,  etc. 

In  this  report,  we  document  progress  on  the  development  of  a  family  of  classification  methods 
that  integrate  classification  (processing)  with  measurement  (sensing).  The  basic  idea  is  to  construct 
a  classifier  that  first  operates  on  a  limited  initial  data  set.  If  the  classifier  cannot  produce  a  decision 
that  has  high  quality  (confidence),  it  requests  the  sensor  measurement  whose  content  will  most  de¬ 
cisively  resolve  the  classification  ambiguity.  This  iteration  continues  until  a  high-quality  decision 
is  reached  or  there  are  no  further  measurements  available  that  can  help  reduce  decision  ambiguity. 

Our  specific  approach  relies  on  binary-tree  classifiers.  In  particular,  we  build  on  the  work 
we  performed  under  the  DARPA  TRUMPETS  program  [  ],  in  which  we  developed  novel  binary- 
tree  classifiers  for  range-doppler  inputs.  The  input-data  statistics  used  to  make  the  decision  at 
each  node  are  wavelet-based  and  the  local  discriminant  basis  (LDB)  methodology  [  ]  is  adapted 
to  find  the  best  set  of  K  such  statistics  independently  for  each  node.  This  methodology  can  be 
viewed  as  a  form  of  wavelet-based  compression  in  which  only  the  most  discriminant  portions  of 
the  data  are  retained,  rather  than  those  that  yield  the  best  reconstruction  fidelity.  In  the  present 
work,  we  develop  an  ISP  classification  framework  in  which  collections  of  binary  trees  are  joined 
together  to  form  binary  hypertrees.  Each  constituent  binary  tree  in  the  hypertree  is  associated  with 
a  particular  kind  of  sensor  or  sensor-target  geometry.  When  an  ambiguous  node  is  reached  during 
tree  traversal  for  classification,  the  classifier  jumps  from  the  current  tree  to  the  tree  that  can  best 
resolve  the  ambiguity.  This  may  necessitate  obtaining  a  new  sensor  output.  Thus,  the  ideal  is  that 
the  minimum  amount  of  data  is  collected  for  the  problem  at  hand  and  average  performance  for  the 
hypertree  is  much  better  than  for  any  single  constituent  tree. 

The  remainder  of  this  document  is  organized  as  follows.  The  basic  mathematical  framework 
for  ISP-enabled  iterative  classification  (ATR)  is  presented  in  Section  2.  This  framework  includes 
definitions  of  several  tree -based  classifiers.  The  concepts  of  operation  for  the  various  classifiers 
are  described  in  Section  3  and  an  approximate  performance  analysis  is  provided  in  Sections  4  and 
5.  Algorithms  for  classifier  construction  and  operation  are  provided  in  Section  6.  An  illustrative 
(toy)  problem  is  examined  in  Section  7,  and  the  applicability  of  the  approach  to  a  wide  variety 
of  classification  problems  and  classifier  structures  is  described  in  Section  8.  Finally,  conclusions 
are  drawn  in  Section  9.  In  a  later  stage  of  this  work,  a  companion  report  will  provide  detailed 
simulation  results  for  the  classifiers  defined  and  described  herein. 


2  Mathematical  Framework 

In  this  section  we  present  the  basic  mathematical  definitions  required  to  study  the  proposed  tree- 
based  classifiers.  In  subsequent  sections  we  introduce  several  propositions  and  further  definitions. 
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Figure  1:  Graphical  depiction  of  a  generic  binary  tree  for  tree  parameter  C  =  8. 


Definition  1  (Classifier  Input  Data  (CID))  The  CID  quantity  X  is  a  vector  or  matrix  that  con¬ 
tains  measurements  obtained  by  sensing  the  environment  in  some  manner  and  optionally  process¬ 
ing  this  sensor  data.  X  is  then  used  as  input  to  a  classifier.  ■ 

Definition  2  (C-Class  Problem)  Given  class-membership  labels  1,2, ... ,  C,  and  at  least  one 
CID  X,  determine  the  class  label  L  that  corresponds  to  the  CID,  L  =  g(X),  L  e  {1,2,...,  C}. 
The  function  g()  represents  the  classifier.  ■ 

Definition  3  (Binary  Tree)  For  the  dyadic  number  C,  a  binary  tree  is  a  collection  of2C  —  1  nodes 
with  a  specific  node-connection  topology.  The  root  node  is  node  number  one  and  has  two  children. 
The  final  C  nodes  are  childless.  All  other  nodes  have  exactly  one  parent  and  two  children.  Each 
node  is  the  child  of  only  one  parent.  Node  connections  and  numbering  are  as  shown  in  Figure  1. 
The  tree  is  said  to  be  balanced  if  the  number  of  nodes  at  level  l  is  twice  the  number  of  nodes  at 
level  l  —  1  for  each  tree  level  greater  than  zero.  Otherwise,  the  tree  is  unbalanced.  ■ 


Definition  4  (Decision  Node)  A  decision  node  in  a  binary  tree  with  parameter  C  is  any  node 
except  one  of  the  final  (childless)  C  nodes  (see  Figure  1).  ■ 

Definition  5  (Terminal  Node)  A  terminal  node  in  a  parameter-C  binary  tree  is  any  node  that  is 
not  a  decision  node.  ■ 

Definition  6  (Binary-Tree  Classifier  (BTC))  A  binary-tree  classifier  is  a  binary  tree  for  param¬ 
eter  C  such  that  each  node  is  associated  with  a  vector-valued  measurement  on  the  CID  X  and  a 
binary  decision  between  two  mutually  exclusive  groups  of  class  labels  each  called  a  superclass. 
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I'OII'li 


Original  8  classes 


Figure  2:  Graphical  depiction  of  a  generic  balanced  binary  tree  classifier  for  tree  parameter  (prob¬ 
lem  size)  <7  =  8. 


The  superclasses  for  node  i  derive  from  splitting  the  inherited  class  label  set  from  the  parent  of 
node  i  into  two  nonempty  and  disjoint  sets.  The  average  measurement  vector  value  for  the  left 
class  is  denoted  by  v  and  that  for  the  right  by  u.  The  length  of  the  measurement  vector  for  all 
nodes  is  denoted  by  K.  The  BTC  definition  is  illustrated  by  Figure  2.  ■ 


Definition  7  (BTC  Node  Ambiguity)  For  a  BTC  decision  node  n,  the  ambiguity  An  is  defined  by 


where 


-An  —  2  n  "f”  1) ! 


K 

E 

k=l 


U. 


n.k 


1/2 


Therefore  the  node  ambiguity  is  a  simple  function  of  the  correlation  coefficient  (CC)  rn  between 
the  left  and  right  feature  vectors  for  node  n.  ■ 


Definition  8  (Node  Descendents)  The  descendents  of  binary-tree  node  n  are  all  nodes  that  can 
be  reached  from  n  by  traversing  the  tree  starting  at  n.  The  descendents  of  the  root  node  ( node  one ) 
are  all  nodes  other  than  node  one.  The  set  of  descendents  for  a  terminal  node  is  empty.  ■ 
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Definition  9  (Average  Downbranch  Ambiguity)  Let  BTC  node  n  have  N  descendents  {rii}f=1. 
Then  the  average  downbranch  ambiguity  (ADA)  for  node  n  is 


Or 


N  + 


1  N 

— Y  A-nk  ) 

1  "  k= 1 


Therefore  the  total  BTC  ambiguity  is  a\.  ■ 

Definition  10  (Corresponding  Nodes)  Let  7\  and  T2  denote  two  distinct BTCs  and  let  ni  and  n2 
denote  decision  nodes  from  T\  and  T2,  respectively.  If  the  union  of  the  left  and  right  superclasses 
for  nodes  rii  and  n-2  match,  then  these  two  nodes  are  corresponding.  If  the  superclasses  are  equal, 
then  the  nodes  are  equivalent.  Note  that  the  binary-tree  parameters  C\  and  C2  for  T\  and  T-2  need 
not  be  equal.  ■ 

Definition  1 1  (Binary  Hypertree)  Let  {Tt}f=1  denote  a  set  of  N  binary  trees  with  identical  pa¬ 
rameters  C.  Associate  with  each  node  n  in  each  tree  i  an  index  in  the  set  {1,2, ...  ,N}  and  a  node 
index.  Then  this  node  n  in  tree  i  is  said  to  contain  a  pointer  to  the  indexed  tree.  The  collection  of 
binary  trees  so  indexed  is  called  a  binary  hypertree.  ■ 

Definition  12  (Binary  Hypertree  Classifier  (BHC))  Let  H  denote  a  binary  hypertree  with  N 
constituent  binary  tree  classifiers  each  with  parameter  C  and  each  addressing  the  same  classi¬ 
fication  problem.  Assign  the  node  pointers  for  each  node  such  that  the  node  points  to  the  BTC 
with  minimum  ambiguity-corresponding  node.  To  classify  a  CID,  select  an  initial  constituent  BTC. 
Use  the  node  pointers  to  switch  constituent  trees  whenever  a  node  does  not  point  to  itself.  Upon 
switching  to  a  BTC,  if  the  CID  for  that  BTC  has  not  been  obtained,  direct  the  appropriate  sensor 
to  obtain  the  data  and  then  proceed.  Figure  3  depicts  graphically  a  BHC  for  N  =  4  and  C  —  8. 
Hypertree  indices  for  decisions  nodes  1,  2,  and  4  are  also  shown  in  the  figure.  ■ 

Definition  13  (Binary  Supertree  Classifier  (BSC))  A  binary  supertree  classifier  is  a  BTC  for 

which  the  CID  is  multimodal.  That  is,  the  CID  is  a  concatenated  or  otherwise  stacked  set  of 
separate  (distinct)  CIDs,  Y  =  The  BSC  can  be  viewed  as  the  binary-tree  classifier  that 

operates  on  all  available  sensor  outputs.  ■ 

3  Concept  of  Operations 

We  will  provide  detailed  algorithm  statements  for  classifier  construction  and  operation  in  Section  6. 
In  this  section  we  provide  a  high-level  overview  of  the  operation  of  the  three  main  classifier  types — 
BTC,  BSC,  and  BHC — to  provide  sufficient  context  for  the  performance  analysis  of  Sections  4  and 
5. 

Binary  Tree  Classifiers. 

A  typical  binary  tree  classifier  (BTC)  is  shown  in  Figure  2.  This  kind  of  classifier  is  constructed  by 
obtaining  a  set  of  training  data  for  one  CID  type  and  all  classes  and  applying  the  LDB  machinery 
of  [  ].  The  resulting  classifier  must  operate  only  on  the  CID  type  with  which  it  was  created.  To 
operate  the  classifier,  we  begin  in  node  one,  compute  the  correlation  between  the  measured  feature 
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Figure  3:  A  binary  hypertree  illustration  for  an  eight-class  problem.  The  hypertree  indices  for 
nodes  one,  two,  and  four  are  shown. 
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and  the  left  and  right  stored  features  and  take  the  path  out  of  the  node  indicated  by  the  larger 
correlation.  This  is  continued  until  a  terminal  node  is  reached. 

Remark  1  If  the  CID  type  is  not  sufficient  to  render  the  total  tree  ambiguity  a\  small,  then  average 
classification  performance  will  likely  be  poor. 

Binary  Supertree  Classifiers. 

A  binary  supertree  classifier  looks  much  like  the  BTC  of  Figure  2.  The  difference  is  that  the  input 
is  a  collection  of  CIDs  of  distinct  types.  The  classifier  is  constructed  by  obtaining  a  set  of  training 
data  for  all  CIDs  of  interest  and  all  classes,  and  then  applying  the  same  LDB  machinery  as  used  for 
the  BTC.  If  the  set  of  CID  types  is  exhaustive  for  a  particular  problem  of  interest  (no  more  can  be 
obtained  from  existing  allocated  resources),  then  the  performance  of  the  BSC  sets  an  upper  limit 
on  the  performance  of  the  BTC. 

Remark  2  If  the  collection  of  employed  CID  types  is  not  sufficient  to  render  the  total  tree  ambi¬ 
guity  a.  i  small,  then  BSC  performance  will  likely  be  poor.  This  means  that  the  designated  CIDs  are 
inadequate  for  the  problem  at  hand. 

Binary  Hypertree  Classifiers. 

A  binary  hypertree  classifier  is  shown  in  Figure  3  for  C  =  8  and  N  =  4.  This  kind  of  classifier  is 
constructed  by  obtaining  the  same  training  data  as  used  in  the  BSC,  that  is,  M  sets  of  CID  data  for 
each  of  the  C  classes,  and  then  constructing  at  least  one  BTC  for  each  of  the  M  CID  types.  The 
BTCs  are  linked  together  as  in  Definition  12.  To  operate  the  classifier,  choose  an  initial  constituent 
BTC  and  obtain  its  CID.  Traverse  the  BTC,  jumping  to  another  BTC  whenever  a  decision  node  is 
encountered  that  does  not  point  to  itself.  For  every  jump  to  another  BTC,  obtain  the  corresponding 
CID  if  it  has  not  already  been  obtained  during  the  traversal  of  the  hypertree. 

Remark  3  The  benefit  of  using  a  BHC  is  that  BSC  performance  can  in  principle  be  achieved 
without  the  cost  of  routinely  obtaining  all  M  CID  types  for  each  traversal  of  the  tree. 


4  Analysis  for  Balanced  Trees 

In  this  section  we  present  some  analysis  results  for  the  specific  case  of  balanced  trees  (see  Defini¬ 
tion  3).  Unbalanced  trees  are  addressed  in  Section  5. 

Each  decision  node  in  a  BTC  or  BHC  must  make  a  binary  decision  based  on  a  vector  of  K 
(wavelet)  measurements  on  the  CID.  This  decision  is  made  by  computing  the  correlation  coef¬ 
ficients  (CCs)  between  the  measured  feature  vector  y  and  the  left  and  right  superclass  average 
feature  vectors  v  and  u. 


zt  =  CC{y ,  v) 
zr  =  CC(y,u). 

If  zi  >  zr,  take  the  left  decision  path  out  of  the  node,  else  if  zt  <  zr,  take  the  right  path,  else  flip  a 
fair  two-sided  coin  to  determine  which  path  to  take. 
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If  the  CC  between  v  and  u  is  large  and  negative,  the  ambiguity  will  be  close  to  zero.  In  this 
case,  if  the  CID  corresponds  to  a  class  in  {1,  2, . . . ,  C}  (i.e.,  it  is  represented  by  the  tree,  not  an 
unknown  class),  then  we  expect  the  correct  decision  to  be  made  at  this  node  almost  all  of  the  time. 

On  the  other  hand,  if  the  CC  is  large  and  positive,  the  ambiguity  is  close  to  one,  and  we 
expect  the  decision  to  be  incorrect  about  half  the  time  on  average.  Let  us  represent  the  node  error 
probability  as  a  function  of  the  node  ambiguity, 

Pe(n)  =  /(AO, 

such  that  /( 0)  =  0  and  /( 1)  =  1/2  and  /(•)  is  continuous  and  monotonic  on  [0, 1],  bounded  below 
by  zero,  and  bounded  above  by  one.  An  example  is  f(x )  =  x/2  for  which  we  have  Pe(n )  =  Anj 2. 

4.1  Binary  Tree  Classifi  er 

We  now  compute  the  BTC  probability  of  correct  classification  for  class  c  €  {1,2, . . . ,  C).  There 
is  only  one  path  through  a  BTC  that  terminates  at  class  c.  Let  the  node  sequence  that  defines  this 
path  be  denoted  by  {nC;i,  nC;2, . . . ,  nCjD},  where  D  =  log2(C).  This  distinct  path  for  class  c  is 
associated  with  the  sequence  of  ambiguities  {Anc  l,AUc  2, . . . ,  Aric  D}. 

Proposition  1  (Error  Performance  for  a  BTC) 

Given  a  C -class  problem  and  a  BTC  associated  with  that  problem,  assume  that  the  decisions  at 
each  node  in  the  path  for  class  c  are  approximately  independent.  Then  the  average  classifi  cation 
error  for  the  BTC  is  given  by 


Pe{BTC) 


1 

C 


E 


i 


D 


IT1  - 

k= 1 


(1) 


Proof.  The  probability  of  correct  classification,  conditioned  on  the  true  input  class  of  c,  is  the 
product  of  marginal  probabilities  of  correct  classification  for  each  node  in  the  path, 

D  D 

P(L  =  c|c)  =  Ps(nC:k )  =  JJ(1  -  Pe{nCyk )) 

k= 1  k= 1 

D 

=  IT1  -  /tv,.))- 

k=l 

Thus,  the  conditional  probability  of  error  is  given  by 

Pe(c)  =  P(L^c\c) 

=  1  —  P(L  =  c|c) 

D 

=  l  -  IT1  -  /(A,,.,)). 

k=l 
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If  we  assume  that  the  prior  probabilities  for  the  C  classes  are  equal,  then  the  total  probability  of 
error  is  given  by 


Pc(BTC)  =  c|c) 


C—l 


c—  1 


k—1 


□ 


Note  that  the  probability  of  error  for  the  BTC  is  completely  characterized  by  the  node  ambigui¬ 
ties.  The  limiting  cases  of  completely  ambiguous  and  unambiguous  BTCs  are  treated  in  the  next 
proposition. 

Proposition  2  (BTC  Error  Probability  Limiting  Cases) 

Given  a  BTC  with  parameter  size  C,  if  all  node  ambiguities  are  zero,  then  the  probability  of  error 
for  the  BTC  is  zero.  If  all  node  ambiguities  are  equal  to  one,  then  the  probability  of  error  is  equal 


to  (C  —  1)/C. 


For  a  BTC,  the  location  in  the  tree  of  highly  ambiguous  nodes  has  a  strong  influence  on  perfor¬ 
mance.  This  is  easy  to  see  since  the  various  decision  nodes  are  components  of  different  numbers  of 
paths.  If  the  root  node  is  highly  ambiguous  with  A1  =  1,  then  all  paths  begin  with  an  ambiguous 
decision  and  the  probability  of  error  for  the  tree  is  1/2.  The  location  of  the  ambiguous  node  is 
grossly  characterized  by  its  level  in  the  tree  (see  Figure  1). 

Proposition  3  (BTC  Error  as  Function  of  Ambiguous  Node  Position) 

Suppose  we  have  a  BTC  with  parameter  C.  Node  n  at  tree  level  l  has  ambiguity  An  =  1  and  all 
other  decision  nodes  have  zero  ambiguity.  Then  the  probability  of  error  for  the  BTC  is  2~(l '  .  ■ 

The  average  error  can  be  minimized  by  ensuring  that  high-level  nodes  (small  l )  have  minimum 
ambiguity.  One  way  of  doing  this  is  by  selecting  the  decision-node  superclasses  to  push  ambiguity 
downward  in  the  tree.  Other  methods  include  extending  the  BTC  to  a  BHC. 

4.2  Binary  Hypertree  Classifi  er 

For  BHCs,  each  node  in  each  of  the  constituent  BTCs  contains  a  pointer  to  the  constituent  BTC 
with  minimum-ambiguity  corresponding  node.  Therefore,  the  ambiguity  AUc  k  seen  at  node  ncj. 
for  the  path  corresponding  to  class  c  in  a  BTC  is  replaced  by 


where  t  indexes  the  constituent  BTCs  with  nodes  corresponding  to  nc^. 

Proposition  4  (Error  Performance  for  a  BHC) 

The  average  error  probability  for  a  BHC  is  given  by 


p,(bhc)  =  ^y,  i-na-/^)) 


C=1  L  k= 1 
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The  error  performance  for  a  BHC  is  completely  characterized  by  the  minimum-ambiguity  corre¬ 
sponding  nodes. 

Error  Bounds. 

Let  us  now  compute  some  bounds  on  the  performance  of  a  BHC.  The  general  expression  for  the 
probability  of  error  for  the  BHC  is 


Pe(£ffC)  =  2  £>«(<:), 

c—1 


where 


Now  0  <  /(•)  <  1  so  that 


D 


f.w = i  -  nu  - /tv.))- 

k= 1 


0  <  f(Bn„t)  <  1 


or 


where 


Therefore 


D 

nc1  -  /(-b„„j)  <  i  -  /(ac) 

k=  1 


Pnc  =  arg  max  f(Bnc  k). 


D 


-  na  - 

k=l 


Pe(c) 


>  "(I 

>  l  -  (l  -/(A,,)). 


Finally,  then,  the  bound  on  the  total  error  probability  is  given  by 

Pc(BHC)  >  T^[i  _  (i  _  /(/3„J)] 

C—1 

= 

c—1 

>  f(P'n), 

where  [5'n  =  arg  maxf.  f(ftnc).  So  the  average  BHC  error  probability  is  bounded  away  from  zero 
by  the  node  with  the  largest  minimized  error  probability  over  all  constituent  BTCs  in  the  BHC. 
When  the  error  function  /(•)  is  a  nondecreasing  function  of  the  ambiguity,  then  f3'n  =  ma x(3n  c. 

That  is,  the  maximum  error  probability  corresponds  to  the  maximum  ambiguity. 
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Remark  4  The  bound  on  error  performance  for  the  BHC  suggests  that  to  improve  performance, 
find  the  most  ambiguous  node  in  the  BHC  paths  and  attempt  to  find  a  new  CID,  new  vector- 
valued  measurement  on  an  existing  CID,  or  new  superclass  definitions  so  as  to  result  in  a  smaller 
ambiguity  for  this  node.  The  minimum  change  to  the  BHC  is  then  to  have  it  point  corresponding 
nodes  to  the  new  BTC. 

4.3  Binary  Supertree  Classifi  er 

Suppose  we  have  M  distinct  CID  types  available  for  use  by  our  classification  system.  These 
could  correspond  to  different  radar  waveforms  (pulse  widths),  frequency  bands,  look  directions, 
or  sensor  modality,  such  as  optical,  infrared,  SAR,  etc.  A  sample  set  of  CIDs  can  be  represented 
by  the  collection  where  each  Xj  is  a  vector  or  matrix  of  data  obtained  from  a  particular 

active  or  passive  sensing  of  the  environment.  A  binary  supertree  classifier  (BSC)  is  a  binary  tree 
classifier  that  takes  as  input  this  set  of  M  CIDs  (see  Definition  13).  Since  the  BSC  is  a  BTC,  its 
performance  is  characterized  by  Proposition  1. 

The  idea  behind  the  BSC  is  that  of  a  maximally  informed  classifier.  Its  performance  serves 
as  an  upper  limit  on  the  performance  of  any  other  tree  based  classifier  since  it  uses  all  available 
information  in  training  and  in  operation. 

Of  course,  the  drawback  of  a  BSC  is  that  it  might  require  an  enormous  amount  of  input  data  if 
the  number  of  distinct  CID  types  is  large  and  the  classification  problem  is  difficult  (different  classes 
require  distinct  kinds  of  CIDs  to  achieve  good  performance).  What  we  would  like  is  the  perfor¬ 
mance  of  the  potentially  impractical  BSC  with  the  relatively  modest  operational  requirements  of 
the  BHC. 

4.4  Relations  Between  Classifi  ers 

In  this  section,  we  present  analysis  results  pertaining  to  the  relationships  between  the  three  tree- 
based  classifier  types.  One  of  our  aims  is  to  determine  the  requirements  on  a  BHC  for  exact  corre¬ 
spondence  between  it  and  a  BSC.  Another  aim  is  to  determine  error-performance  requirements  for 
the  constituent  BTCs  of  a  BHC  for  good  BHC  performance:  How  good  do  the  BTCs  need  to  be  to 
guarantee  good  BHC  performance? 

4.4.1  Relations  Between  BSCs  and  BHCs 

From  an  intuitive  point  of  view,  the  performance  of  a  BSC  should  lower  bound  the  performance  of 
any  BHC  for  the  same  problem  because  the  BSC  has  all  available  CID  types  and  can  process  them 
jointly  at  each  decision  node. 

Let  us  first  consider  the  class  of  BSCs  for  which  the  measurements  made  at  each  decision  node 
require  only  one  CID  type;  that  is,  each  if -vector  of  measurements  at  node  n  requires  only  one 
element  of  the  multimodal  CID  Y  =  {XJ}^11.  For  this  class  of  BSCs,  we  expect  that  the  BSC 
performance  can  be  exactly  achieved  by  a  relatively  simple  BHC.  These  considerations  lead  to  the 
following  two  definitions. 


ISP  Technical  Note 


124 


13 


Mission  Research  Corporation 


Integrated  Sensors  &  Processing 


Definition  14  (Reducible  Binary  Supertree  Classifier)  Let  the  BSC  S  correspond  to  the  multi¬ 
modal  CID  Y  =  {X.j}jLv  S  is  reducible  if  the  measurement  vector  for  each  decision  node  n  is 
associated  with  only  one  element  ofY.  Otherwise  the  BSC  is  irreducible.  ■ 

Definition  15  (Simple  Binary  Hypertree  Classifier)  Let  the  multimodal  CID  be  represented  by 
Y  =  An  associated  BHC  is  called  simple  if  all  constituent  BTC s  have  CIDs  that  corre¬ 

spond  to  a  single  element  ofY.  Otherwise,  if  at  least  one  BTC  has  a  multimodal  CID,  the  BHC  is 
called  complex.  ■ 

Let’s  also  formalize  the  notation  for  describing  the  required  CID  elements  at  a  decision  node. 

Definition  16  (Mode  Subset)  Let  n  represent  a  decision  node  of  an  irreducible  BSC  that  is  as¬ 
sociated  with  the  multimodal  CID  set  Y  =  The  K -vector  of  measurements  associated 

with  n  requires  at  least  one  element  ofY  and  at  most  all  M  elements.  Let  I  represent  the  vector  of 
indices  in  {1,2 , . . .  ,M}  that  are  actually  required  by  node  n.  I  is  called  the  mode  subset  vector. 


The  performance  of  any  reducible  BSC  can  be  achieved  by  a  simple  BHC,  which  is  the  subject 
of  the  following  proposition. 

Proposition  5  (Equivalence  Between  Reducible  BSC  and  Simple  BHC) 

Let  S  represent  a  reducible  BSC  associated  with  the  C -class  problem  Pi  and  the  multimodal  CID 
set  Y  =  (Xj}|{=1.  Then  there  exists  a  simple  BHC  H  for  Pi  associated  with  Y  such  that  Pe(S )  = 
Pe(H).  Moreover,  there  are  M  BTCs  in  H.  ■ 

Proof.  (By  construction.)  For  each  j  =  1,2,...,  M,  construct  a  BTC  Tj  for  {Xj}  such  that  the 
superclasses  for  each  node  n  in  each  BTC  match  those  in  S  for  node  n.  Therefore  each  decision 
node  in  the  BSC  has  M  corresponding  nodes  in  the  set  of  M  BTCs.  Consider  decision  node  m  in 
S.  Since  S  is  reducible,  the  feature  vector  for  this  node  can  be  obtained  from  measurements  on 
one  of  the  CID  types,  say  X/£.  For  BTC  Tk ,  associate  the  measurement  specification,  u,  and  v  for 
node  rn  in  S  with  node  rn  in  Tk.  For  all  Tj,  point  node  rn  to  BTC  Tk.  Repeat  this  procedure  for  all 
decision  nodes  in  S.  The  resulting  set  of  M  linked  BTCs  forms  a  BHC  H.  By  construction,  each 
possible  traversal  of  S  corresponds  to  an  identical  traversal  of  H,  and  there  are  no  other  traversals 
in  either  classifier.  Therefore,  Pe{S)  =  Pe(H).  □ 

Even  when  a  BSC  employs  multimodal  measurements  at  some  or  all  decision  nodes,  its  perfor¬ 
mance  can  still  be  obtained  by  using  a  properly  designed  BHC.  If  most  or  all  of  the  decision  nodes 
in  the  BSC  use  measurements  that  require  most  or  all  of  the  elements  of  Y,  then  there  may  be  no 
operational  benefit  to  using  the  equivalent  BHC.  But  when  only  a  few  nodes  require  multimodal 
measurements,  there  can  be  great  operational  and  training  advantages  to  using  the  BHC  over  the 
BSC. 

The  relationship  between  irreducible  BSCs  and  complex  BHCs  is  summarized  in  the  following 
proposition. 

Proposition  6  (Equivalence  Between  Irreducible  BSC  and  Complex  BHC) 

Let  S  represent  an  irreducible  BSC  associated  with  the  multimodal  CID  set  Y  =  and 

a  C -class  problem  Pk.  Then  there  exists  a  complex  BHC  H  associated  with  Y  and  Pk  such  that 
Pe(S )  =  Pe(H).  U 
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Proof.  (By  construction.)  First  consider  the  decision  nodes  in  S  that  have  mode  subset  vectors 
with  length  one,  say  {q3}f=1-  For  each  q3,  construct  a  BTC  Z3  such  that  its  node  q3  performs  the 
same  operations  as  S  on  the  CID  associated  with  q3\  point  node  q3  in  Z3  to  Z3.  For  the  remaining 
Q  —  1  BTCs,  point  their  q3  nodes  to  Z3.  Now  consider  the  decision  nodes  in  S  for  which  the  mode 
subset  vector  has  length  greater  than  one,  say,  {pj}f=1.  For  each  of  these  nodes,  construct  a  new 
BTC  in  the  following  way.  For  node  p3,  define  a  multimodal  CID  Y3  =  {Xfc}fceij,  where  I3  is 
the  mode  subset  vector.  Let  Z'  denote  the  BTC  associated  with  p3 .  The  CID  for  Z'  is  Y3,  and 
for  node  p3  in  Z',  perform  the  same  operations  on  Y3  as  in  S'.  Point  the  p3  nodes  in  all  P  new 
BTCs  to  Z'3.  If  Q  >  0,  then  examine  new  BTC  Zx  to  determine  how  to  point  the  single-mode 
nodes  in  the  P  new  BTCs.  Similarly,  for  the  Q  new  single-mode  BTCs,  examine  Z\  to  determine 
how  to  point  the  multimodal  nodes.  The  resulting  P  +  Q  BTCs  define  a  complex  BHC  H  since 
P  >  1.  By  construction,  each  possible  path  through  S  has  an  identical  path  through  H.  Therefore, 
Pe(S)  =  Pe{H).  □ 

Remark  5  For  BSCs  that  are  only  mildly  irreducible  ( meaning  that  P  <C  Q),  the  number  of 
constituent  BTCs  in  the  BHC  will  be  only  slightly  larger  than  M,  yet  the  performance  will  be 
equal  to  that  of  the  BSC. 

4.4.2  Approximation  of  BHCs  by  BHCs 

We  have  demonstrated  that  any  BSC  can  be  exactly  represented  by  a  BHC.  Since  the  computa¬ 
tional  and  storage  burden  of  the  BHC  is  directly  related  to  the  number  of  constituent  BTCs  and 
their  CIDs  (multimodal  or  single-mode),  we  are  now  interested  in  the  possibility  of  approximating 
complicated  BHCs  with  simpler  ones.  So  we  investigate  the  performance  penalty  in  approximating 
one  BHC  by  another. 

Incremental  Approximation. 

Let  us  first  consider  the  smallest  possible  difference  between  two  BHCs:  a  difference  in  a  single 
decision  node.  For  example,  suppose  that  in  BHC  Hi,  node  n  obtains  its  minimum  ambiguity  for 
constituent  BTC  k,  which  has  a  multimodal  CID.  Let  the  dimension  of  the  mode  subset  vector  for 
node  n  in  BTC  k  be  v  >  1.  Consider  now  another  BTC  with  the  same  superclasses  as  k  but  for 
which  the  CID  has  dimension  v  —  1.  Form  a  new  BHC,  H2,  for  which  all  corresponding  nodes 
with  indices  n  point  to  this  new,  lower-dimensional  BTC.  All  other  corresponding  nodes  (those 
with  indices  other  than  n )  in  all  trees  of  H2  take  the  same  values  as  their  counterparts  in  H\.  Thus, 
the  only  difference  between  Hi  and  H2  occurs  for  decision  node  n. 

Proposition  7  (Incremental  BHC  Error  Performance) 

Let  Hi  and  H2  be  BHCs  for  a  given  C -class  problem  and  multimodal  CID  Y  =  Let 

Hi  and  H2  differ  only  in  a  single  decision  node  n  for  which  the  minimum  ambiguity  over  all 
corresponding  nodes  with  index  n  is  larger  in  H2  than  in  Hi.  Then  the  performance  difference  for 
the  two  BHCs  is  given  by 

re(H2)  -  Pe(Hx)  ~ - ^5 - , 

where  B'n  and  Bn  are  the  minimum  ambiguities  for  H2  and  Hi,  respectively,  at  node  n,  and  kp  —  1 
denotes  the  tree  level  of  node  n.  ■ 
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Proof.  Recall  that  the  performance  of  a  BHC  is  given  by 


Pe(H) 


1 

c 


E 


D 


1  -  no  -  /(s»«,.)) 

k=l 


where  D  =  log2(C),  BUc  k  is  the  minimum-ambiguity  node  over  all  constituent  BTCs  for  node  nc jfc, 
which  is  the  A:th  node  encountered  along  the  unique  path  to  class  c.  The  performance  difference  is 
computed  straightforwardly  as  follows 


D 


1  -  IT1  - J) 


k= 1 
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pah,)  -  p,m  =  ^E 
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C—l 


k^ko 


Since  the  node  in  question  is  at  level  ko  —  1,  it  is  part  of  exactly  C j 2ko~1  paths.  For  all  other 
paths  through  the  BHC,  the  difference  between  Hi  and  H2  is  zero.  Therefore,  the  performance 
difference  is  given  by 

PAH,)  -  PAH,)  =  f  YJJ(K)  -  l(Bn))  no- 

c£?7  k^ko 

where  U  denotes  the  set  of  class  indices  for  which  the  unique  path  from  root  node  to  terminal  node 
includes  node  n.  If  the  BHC  H{  is  sufficiently  well  constructed  and  the  CID  is  sufficient  for  good 
classification  performance,  then  we  make  the  approximation 


n  (!  -  « v 

kj^ko 

which  implies  that 

PAH,)  -  PAH,)  »  nB'n)2~S{Bn) . 

□ 


Remark  6  Proposition  7  implies  that  low-dimensional  node  approximations  are  better  suited,  in 
general,  to  parts  of  the  tree  that  are  nearer  the  terminal  nodes  (larger  ko). 

Now  let’s  look  at  a  more  general  approximation  of  a  BHC  by  a  simpler,  lower-dimensional 
BHC.  For  BHC  H±,  the  minimum  decision  node  ambiguities  are  denoted  by  BUc  k  and  for  BHC 
H2,  which  approximates  Hi,  the  ambiguities  are  denoted  by  B' 

Proposition  8  (Approximation  of  a  BFIC) 

Let  the  BHCs  Hi  and  H2  correspond  to  the  same  C -class  problem  and  multimodal  CID  Y  = 
{Xj}j=a.  H2  approximates  Hi  by  employing  lower-dimensional  CIDs  at  one  or  more  nodes,  or 
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by  using  a  smaller  value  of  K  at  one  or  more  nodes.  Assume  that  the  minimum  decision-node 
ambiguities  for  Hi  lower  bound  those  for  H2:  B'v  >  Bnc,k-  Then 


1  C 
C=1 


D 


D 


J 


1=1 


/(«»„,))  n 

k=l,kyi 


Proof.  From  the  proof  of  Proposition  7,  the  performance  difference  between  H2  and  Hi  can  be 
expressed  as 

Pe(H2)  ~  P.(HJ  =  7^ 

c—  1 

By  the  monotonicity  of  /(•),  we  have 

Vc,  k, 

that  is,  the  error  probabilities  for  H2  are  no  smaller  than  those  for  Hi.  It  is  convenient  to  represent 
the  errors  for  H2  in  terms  of  those  for  Hi  plus  an  additional  term, 

/( B'nck )  =  f(Bnck )  +  eUck, 

where  eUck  >0.  Then  the  error  difference  can  be  represented  by 

1  c  D  D 

-  P'( Hi)  =  =  Y,  IT1  -  -  If1  - 

c=l  k= 1  k=l 

By  computing  the  product  involving  eUc  k  and  retaining  only  terms  that  are  independent  of  e  or  are 
linear  in  e,  we  obtain 

1  c  (  D  D 

P'(U 2)  -  Pe(  Hi)  ~  p  E  E  e»-  If1  - 

C—1  \  /=1  k^-l 

which  is  the  desired  result.  □ 

4.4.3  Relations  Between  BHC  and  its  Constituent  BTCs 

In  this  section,  we  consider  the  requirements  on  the  constituent  BTCs  of  a  BHC  for  good  BHC 
performance.  Since  the  BHC  performance  depends  only  on  the  minimum-ambiguity  corresponding 
decision  nodes,  it  appears  that  excellent  BHC  performance  may  be  had  as  long  as  there  is  a  BTC 
with  low  ambiguity  for  each  decision  node.  The  BTC  nodes  that  do  not  achieve  the  minimum 
ambiguity  are  irrelevant  to  BHC  performance  and  can  be  highly  ambiguous.  These  considerations 
suggest  the  definition  of  a  peculiar  kind  of  BTC  that  is  good  at  making  only  a  single  decision. 

Definition  17  (Savant  Binary  Tree  Classifier)  A  BTC  fora  C -class  problem  is  e-savant  if  one  of 
its  decision  nodes  has  ambiguity  less  than  or  equal  to  e  and  all  other  decision  nodes  have  ambiguity 
greater  than  or  equal  to  1  —  e.  ■ 


D 


D 


0(1  -  f(Bnc,k))  ~  IT1  -  f(B’nJ) 


U-i 


k= 1 
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The  performance  for  a  savant  BTC  depends  strongly  on  the  position  of  the  unambiguous  node 
in  the  tree,  as  shown  in  the  following  proposition. 

Proposition  9  (Performance  for  a  Savant  BTC) 

Suppose  we  have  a  savant  BTC  T  with  parameter  e  and  the  unambiguous  node  is  at  tree  level 
k0  G  {0, 1, . . . ,  D  —  1}.  Then  the  performance  for  T  is  given  by 

P,(T)^^.(2^(C-l)-l  +  2e). 

provided  that  /(e)  e  and  /(I  -  e)  «  1/2.  ■ 

Proof.  The  low-ambiguity  node  for  T  must  be  part  of  exactly  C/2ko  paths  through  the  tree.  From 
(1),  we  can  approximate  the  performance  for  T  as  follows 

d  \  /  D 


PAT) 


1 

c 


E  1  -  IT1  -  /(!  -  '))  +  E  1  -  (!  -  /M)  n  (!  -  /(!  -  *)) 


JCEU  l 


k=l 


cEU2 


k^ko+l 


where  \U,\  =  C  —  Cj 2*°  and  \U2\  =  C/2ko.  For  small  e,  /(I  —  e)  re  1/2  and  /(e)  ss  e  so  that 


PAT)  »  p 


(C  -  C/2*»)(l  -  (i)D)  +  C/2‘»(l  -  (1  -  e)(i)D-‘) 


After  further  simplification,  we  arrive  at 

Pe(T) 


C  2k° 


[2ko{C-  1)  -  1  +  2e]. 


□ 


Remark  7  The  limiting  cases  for  the  performance  of  a  savant  BTC  correspond  to  /e0  =  0  and 
ko  =  D— 1.  For  the  former,  the  small-ambiguity  node  is  the  root node  and  we  have  Pe  ss  (C—2)/C, 
which  is  slightly  better  than  the  worst  case  in  which  all  nodes  have  high  ambiguity.  For  the  latter 
case,  the  small-ambiguity  node  is  just  above  the  layer  of  terminal  nodes  and  Pe  (c-i)/c- 
2/C2  which  is,  again,  slightly  better  than  the  worst  case. 

Now  if  a  BHC  is  made  up  of  savant  BTCs  and  every  decision  node  in  the  BHC  is  covered  by 
the  unambiguous  node  for  at  least  one  of  the  BTCs,  then  we  expect  BHC  performance  to  be  very 
good  independently  of  the  poor  performance  of  the  BTCs.  This  leads  to  the  definition  of  a  savant 
covering. 

Definition  18  (Savant  Covering)  For  a  C -class  problem,  consider  the  collection  of  N  distinct 
BTCs  If  for  each  set  of  corresponding  nodes  that  are  defined  by  this  collection,  there  is 

at  least  one  BTC  with  node  ambiguity  less  than  or  equal  to  e,  then  the  BTC  set  is  called  a  covering. 
If  in  addition  the  N  BTCs  are  each  e-savant,  then  the  covering  is  an  e-savant  covering.  ■ 

Proposition  10  (BTC  and  BHC  Performance  for  Savant  Covering) 

Let  FI  represent  a  BHC  for  a  given  C -class  problem  and  let  {Tk)^=1  represent  the  constituent 
BTCs  for  H.  If  the  BTCs  form  a  covering  with  parameter  e  then  the  performance  for  H  obeys  the 
following  bound 

pah )  <  i  -  (i  -  mf. 
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Moreover,  if  the  covering  is  e-savant,  then  the  performances  for  the  constituent  BTCs  are  bounded 
by 


Pe(Tk )  > 


C- 2 
C  ’ 


VA. 


Proof.  Since  the  constituent  BTCs  form  a  cover  with  parameter  e,  the  minimum  ambiguity  for  any 
set  of  corresponding  nodes  is  less  than  or  equal  to  e.  Therefore,  performance  for  the  BHC  is  given 
by 

D 

1- IP1 -/(£».,»))  , 

k= 1 

where  each  f{Bnc  k)  <  /(e).  Thus,  we  can  obtain  the  desired  result  in  the  following  way 
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11(1  -  /(Bn,J) 
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k=l 

c= 1  k= 1 


Pe(H) 


>  -/w 

>  1-/W 

>  nd-/w)=(i-/w)D 

k=l 

<  -d-/W)D 

C—l 

<  1-(1-/(£))d. 


We  have  established  the  performance  bound  for  the  hypertree  classifier  for  any  e  covering.  If  the 
BTCs  form  an  e-savant  covering,  then 

F<(Tt)  =  pd_[  2‘"W(C-l)-l  +  2£], 

where  ko(k)  is  the  level  of  the  e-ambiguity  node  in  tree  Tk .  The  two  performance  extremes  corre¬ 
spond  to  k0(k)  =  0  and  D  —  1.  Since  the  performance  for  hoik)  =  0  is  better  than  that  for  any 
other  k{)(k),  we  have 

Pe(Tk)  >  i[C'-2  +  2e], 

thereby  establishing  the  proposition.  □ 


Remark  8  Proposition  10  implies  that  a  high-performance  classifier  can  be  constructed  from  a 
set  of  classifiers  with  very  poor  performances  provided  that  these  classifiers  can  each  make  a  single 
low-error  decision. 

For  the  case  in  which  constituent  BTCs  are  not  savant,  we  have  the  following  proposition. 
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Proposition  11  (Bounded  BTC  Node  Ambiguities) 

Let  the  BHC  H  have  constituent  BTCs  If  the  maximum  node  ambiguity  for  each  f\  is 

no  larger  than  e  =  /-1[  1  —  (1  —  (5) 1,?/J],  where  D  =  log2(C),  then  the  performance  for  H  obeys 
Pe(H )  <5.  U 

Proof.  It  follows  easily  from  the  premises  that  the  performance  for  H  is  bounded  by 

Pe(H)  <  1  -  (1  -  f(i))D. 

Setting  5  equal  to  the  right  side  of  this  inequality  yields  the  solution  for  e, 

e  =  f-\l-  (l-d)1/D). 


□ 


5  Analysis  for  Unbalanced  Trees 


What  are  the  expected  conceptual  differences  between  the  performances  of  balanced  and  unbal¬ 
anced  tree  classifiers?  Because  the  unbalanced  trees  have  more  degrees  of  freedom  in  superclass 
selection,  one  would  think  that  the  performance  for  the  best  unbalanced  tree  would  lower  bound 
the  performance  for  the  best  balanced  tree. 

In  this  section,  we  revisit  the  propositions  of  the  previous  section  on  balanced  trees  to  determine 
the  major  mathematical  differences  between  the  two  tree  types. 

Like  balanced  trees,  we  impose  the  constraint  that  the  superclasses  at  each  decision  node  be 
mutually  exclusive.  Unlike  balanced  trees,  the  sizes  of  the  superclasses  at  a  node  do  not  have  to  be 
equal.  Therefore,  there  is  still  only  a  single  path  through  the  tree  from  root  node  to  terminal  node 
for  each  class.  The  length  of  each  path  is  now  a  function  of  the  class  label  c.  For  example,  consider 
the  unbalanced  tree  in  Figure  4.  The  path  lengths  are  denoted  by  D(c)  and  range  from  2  to  5. 

As  before,  we  assume  that  the  successive  decisions  made  at  the  decision  nodes  are  approxi¬ 
mately  independent.  The  sequence  of  nodes  visited  as  the  tree  is  traversed  along  the  unique  path 
from  root  node  to  terminal  node  for  class  c  is,  as  before,  { nc 


Proposition  12  (Unbalanced  BTC  Error  Performance) 

The  average  error  performance  for  an  unbalanced  BTC  T  that  corresponds  to  a  C -class  problem  is 
given  by 


C—  1 


1 


D{c) 

11(1 -/(-VJ) 


k= 1 


Proof.  The  proof  is  similar  to  that  for  Proposition  1.  □ 

Next  we  revisit  the  limiting  cases  of  the  basic  BTC  performance  formula. 


Proposition  13  (Unbalanced  BTC  Error  Probability  Limiting  Cases) 

Given  an  unbalanced  BTC  with  parameter  size  C,  if  all  node  ambiguities  are  zero,  then  the  proba¬ 
bility  of  error  for  the  BTC  is  zero.  If  all  node  ambiguities  are  equal  to  one,  then  the  probability  of 
error  is  equal  to  (C  —  1 ) /C.  ■ 
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Figure  4:  Illustration  of  the  variable  tree-traversal  path  length  in  an  unbalanced  tree  classifier. 
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Proof.  TBD. 


□ 


Thus,  the  limiting  cases  for  balanced  or  unbalanced  BTCs  are  the  same.  We  now  turn  to  the 
more  interesting  problem  of  determining  BTC  performance  as  a  function  of  the  position  of  a  single 
ambiguous  node.  Our  general  expression  for  the  probability  of  error  is  given  by 


C  D{c ) 


C—l 


k—1 


From  Figure  4,  we  see  that  the  tree  level  of  an  ambiguous  node  does  not  uniquely  determine  how 
many  class  paths  are  influenced  by  the  node.  Thus,  we  cannot  find  a  formula  for  performance  that 
depends  only  on  the  node’s  level.  However,  intuitively,  the  number  of  class  paths  is  still  important, 
which  leads  to  the  following  definition. 

Definition  1 9  (Node  Impact)  The  impact  of  a  decision  node  n  in  an  arbitrary  BTC  is  the  number 
of  terminal  nodes  in  n’s  descendents.  Thus  the  impact  must  be  an  integer  between  two  and  C. 

Proposition  14  (Unbalanced  BTC  Error  as  a  Function  of  Ambiguous  Node  Position) 

For  a  C -class  problem,  let  the  unbalanced  BTC  T  have  a  single  ambiguous  node  n  with  ambiguity 
An  =  1  and  impact  I.  Then  the  error  performance  for  T  is 


Proof.  The  generic  BTC  performance  formula  is 


C  D{c ) 


p=(t)=^e  1  -  no 


c=l  k=l 


Denote  by  Ci  the  subset  of  class  labels  for  which  node  n  lies  along  the  path  leading  to  the  label. 


For  all  c  not  in  Cj,  the  contribution  to  the  error  is  zero  since  all  nodes  have  zero  ambiguity  for 
these  paths.  Thus,  the  performance  is  given  by 


D{c) 


ceCj  k=i 


Assuming  that  /( 1)  =  1/2  yields 


P.(T)  =  4  ■£  [1  -  (1  - 


ceCj 

I 

2 C' 


□ 


Remark  9  Note  that  the  performance  for  an  unbalanced  BTC  reduces  to  that  for  the  balanced 
case  because  the  impact  must  be  of  the  form  I  =  C/2ko.  Thus  Pe  =  I/2C  =  (C/2k°  )/ (2C)  = 

2~(k0+l) 
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Proposition  15  (Error  Performance  for  an  Unbalanced  BHC) 

The  average  error  probability  for  an  unbalanced  BHC  is  given  by 


Pe{BHC ) 


1  c 

hX 

c—  1 


D 


i>xi  -  /(«».,»)) 


k=  1 


Proof.  Similar  to  that  for  Proposition  4.  □ 

Proposition  16  (Reducible  BSC  and  Simple  BHC  for  Unbalanced  Trees) 

Let  S  represent  a  reducible  unbalanced  BSC  associated  with  the  C -class  problem  Pi  and  the  mul¬ 
timodal  CID  set  Y  =  {Xj}j'f=1.  Then  there  exists  a  simple  unbalanced  BHC  H  for  Pi  associated 
with  Y  such  that  Pe(S )  =  Pe{H).  Moreover,  there  are  M  BTCs  in  H.  ■ 

Proof.  Similar  to  that  for  Proposition  5.  □ 

Proposition  17  (Irreducible  BSC  and  Complex  BHC  for  Unbalanced  Trees) 

Let  S  represent  an  irreducible  unbalanced  BSC  associated  with  the  multimodal  CID  set  Y  = 
{X,}",  and  a  C -class  problem  Pi.  Then  there  exists  a  complex  unbalanced  BHC  H  associated 
with  Y  and  Pi  such  that  Pe(S)  =  Pe(H).  ■ 

Proof.  Similar  to  that  for  Proposition  6.  □ 

Proposition  18  (Incremental  Unbalanced  BHC  Error  Performance) 

Let  Hi  and  H2  be  unbalanced  BHCs  for  a  given  C -class  problem  and  multimodal  CID  Y  = 
{Xj}^.  Let  Hi  and  H2  differ  only  in  a  single  decision  node  n  for  which  the  minimum  ambiguity 
over  all  corresponding  nodes  with  index  n  is  larger  in  H2  than  in  Hi.  Then  the  performance 
difference  for  the  two  BHCs  is  given  by 

Pe(H2)  -  P.(H 0  «  )  -  f(Bn)). 

where  B'n  and  Bn  are  the  minimum  ambiguities  for  H2  and  Hi,  respectively,  at  node  n,  and  I 
denotes  the  impact  of  node  n.  ■ 

Proof.  Similar  to  that  for  Proposition  7,  except  the  node  impact  determines  the  underlying  perfor¬ 
mance  of  the  BTCs  and  BHC.  □ 

Proposition  19  (Approximation  of  an  Unbalanced  BHC) 

Let  the  unbalanced  BHCs  Hi  and  H2  correspond  to  the  same  C -class  problem  and  multimodal  CID 
Y  =  {Xj}|f  H2  approximates  Hi  by  employing  lower-dimensional  CIDs  at  one  or  more  nodes, 
or  by  using  a  smaller  value  of  K  at  one  or  more  nodes.  Assume  that  the  minimum  decision-node 
ambiguities  for  Hi  lower  bound  those  for  H2:  B'v  >  Bnc,k-  Then 


P,(H2)  -  m 

C—  1 


Die)  Die) 

X(/<bc)  -  /(-B".,.))  n  v-f(B„„t)) 


1=1 


k=l,k^l 
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Proof.  Similar  to  that  for  Proposition  8,  except  the  sums  over  c  have  variable  limits  set  by  D(c). 

□ 

Proposition  20  (Performance  for  an  Unbalanced  Savant  BTC) 

Suppose  we  have  an  unbalanced  savant  BTC  T  with  parameter  e  and  the  unambiguous  node  has 
impact  I.  Let  the  set  of  class  labels  which  have  tree-traversal  paths  that  do  not  contain  the  unam¬ 
biguous  node  be  denoted  byUi  and  the  rest  by  C/2.  Then  the  performance  for  T  is  given  by 


P*(T)  «  £ 


C-J2  2“d<c>  -  2(1  -  e)  Yi  2_D(C) 
ceui  ceu2 


provided  that  /(e)  ~  e  and  /(I  —  e)  «  1/2. 

Proof.  For  any  BTC,  the  performance  formula  is  given  by 

p.m  =  ±£ 


C—l 


B(c) 

i-m 

k=l 


We  can  divide  the  sum  over  c  into  two  components,  one  which  contains  a  contribution  from  the 
unambiguous  node,  and  one  which  does  not, 


Pe(T) 


1 

C 


D{°) 

_  na _  ^ _  e))) + _  ^ _ 

c£J7i  A;=l  c£U2  k^kc 


where  kc  denotes  the  index  of  the  unambiguous  node  in  the  path  |  for  class  c.  Note  that 
| C/2 1  =  I,  and  \Ui  \  =  C  —  I.  Straightforward  algebra  leads  to  the  desired  result.  □ 


Proposition  21  (Unbalanced  BTC  and  BHC  Performance  for  Savant  Covering) 

Let  H  represent  an  unbalanced  BHC  for  a  given  C -class  problem  and  let  {Tfc}//=1  represent  the 
constituent  BTCs  for  H.  If  the  BTCs  form  a  covering  with  parameter  e  then  the  performance  for 
H  obeys  the  following  bound 

P.(H)  <  1  -  (1  -  /(e)) f , 

where  D*  =  maxH(c).  ■ 

C 

Proof.  By  the  definition  of  a  cover,  we  have  the  fundamental  inequality  for  all  decision  nodes  in 
the  BHC  Bnc  k  <  e  which,  by  the  monotonicity  of  /(•),  yields  f(BUck )  <  /(e).  We  obtain  the 
following  sequence  of  inequalities 


~f{Bn„ J 

> 

-/(e) 

1  - 

> 

1  -  /(e) 

Die) 

Die) 

nci  - 

> 

k— 1 

k— 1 

D(c) 

n<i-/<s»<..)) 

*=1 
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(1  -  /(e))D<'> 

ISP  Technical  Note 


135 


24 


Mission  Research  Corporation 


Integrated  Sensors  &  Processing 


D(c) 

£  -<i  -m)m 

k=l 
D(c) 

1-  n(l-/(B»„J)  <  1  -  (1  - /(e))D<c| 

k= 1 


Die) 


C 


c=l  k=l 


c=  1 
C 


P.(H)  <  I^l-(l-/(£)) 

C=1 


D(C) 


Since  /(e)  must  lie  between  zero  and  one,  we  have 

(i  -  /(e»DW  >  (i  -  /(e))D- 

and  therefore  that 


C—l 


□ 


Proposition  22  (Bounded  Unbalanced  BTC  Node  Ambiguities) 

Let  the  unbalanced  BHC  H  have  constituent  BTCs  {Th}^=1.  If  the  maximum  node  ambiguity  for 
each  Tk  is  no  larger  than  e  =  /-1[1  —  (1  —  <5)1/,£)*],  where  D,.  =  ma xD(c),  then  the  performance 
for  H  obeys  Pe(H )  <5.  ■ 

Proof.  The  proof  is  similar  to  that  for  Proposition  11.  □ 

6  Algorithms 

In  this  section,  we  provide  high-level  algorithm  statements  for  construction  and  use  of  the  tree- 
based  classifiers.  Construction  consists  of  exploiting  the  LDB  to  automatically  find  the  most  dis¬ 
criminating  statistics  in  the  training  set.  Classifier  use  simply  means  using  the  classifier  to  classify 
a  CID.  We  begin  with  some  notation  so  that  the  algorithm  statements  are  sufficiently  concise. 

6.1  Notation 

We  adopt  a  programming- style  notation  consisting  of  variables  and  their  records.  Let  btc  denote 
a  single  binary  tree  classifier  (BTC)  and  bhc  denote  a  BHC.  These  trees  are  associated  with  the 
following  records. 

bte.cidType  A  string  or  numeric  indicating  the  type  of  input  used  to  train  the  BTC.  For  example, 
“range-doppler  chips”  or  “optical  image.” 

btc. angle  The  viewing  angle  for  the  targets  (equivalent  to  pose). 
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btC.C  The  size  of  the  classification  problem  addressed  by  btc. 

btc.a  The  downbranch  ambiguity  for  node  1  (root  node  ADA,  this  is  the  total  tree  ambiguity), 
btc. index  The  unique  index  of  the  BTC  (used  when  the  BTC  is  a  constituent  tree  for  a  BHC). 
btc.n  This  is  an  array  of  node  records,  defined  next. 

n. index  The  index  of  the  BTC  node  using  standard  top-to-bottom  left-to-right  numbering  starting 
with  index  1  (see  Figure  1). 

n.inSuperClass  An  array  of  class  labels  (integers)  that  are  associated  with  the  decision  to  enter 
the  node. 

n.leftSuperClass  An  array  of  class  labels  (integers)  that  correspond  to  taking  the  left  path  out  of 
the  node. 

n.rightSuperClass  An  array  of  class  labels  (integers)  that  correspond  to  taking  the  right  path  out 
of  the  node.  The  union  of  leftSuperClass  and  rightSuperClass  is  inSuperClass  for  each 
node. 

n.A  The  ambiguity  for  node  n. 

n.u  The  right-path  if -component  feature  vector. 

n.v  The  left-path  A'-componcnt  feature  vector. 

n. wavelet  String  containing  the  wavelet  class  (mother  wavelet  type,  e.g.,  COiflet). 

n.w  Specifies  the  K  wavelet  transform  locations  from  which  to  obtain  the  desired  i\ -component 
feature  vector  for  comparison  with  n.u  and  n.v. 

n.minTree  Index  of  the  BTC  (tree)  with  minimum  ambiguity  corresponding  node. 

n.ada  The  average  downbranch  ambiguity  for  the  node. 

n.leftNode  Index  of  the  node  corresponding  to  the  left  path  out  of  n. 

n.rightNode  Index  of  the  node  corresponding  to  the  right  path  out  of  n. 

n.  parent  Index  of  parent  node  for  n  (set  to  0  for  the  root  node). 

bhc.numBTC  The  number  of  constituent  BTCs  in  the  BHC. 

bhc.btc  An  array  of  numBTC  BTCs. 

bhc.C  The  number  of  classes  in  the  hypertree  (classification  problem  size). 
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1.  Obtain  the  training  data  sets  for  the  C-class  problem  of  interest 
and  specify  the  allowable  set  of  wavelets. 

2.  Specify  the  input  data  type,  generating  btc.cidType. 

3.  Construct  a  binary  tree  with  log2(C)  +  1  levels. 

4.  For  each  decision  node,  select  left  and  right  superclasses. 
Nominally,  these  are  of  equal  size  and  consist  simply  of  the 
leftmost  elements  for  the  left  superclass  and  the  rightmost  ele¬ 
ments  for  the  right  superclass. 

5.  Find  the  LDB  for  each  wavelet  type  and  decision  node  in  the 
tree. 

6.  Select  the  most  discriminating  K  elements  of  the  most  discrimi¬ 
nating  LDB  for  each  node,  generating  btc.n. wavelet  and  btc.n.w 
for  each  decision  node. 

7.  Find  the  average  value  of  the  K  LDB  elements  for  each  super¬ 
class  in  each  decision  node,  generating  btc.n. u  and  btc.n. v  for 
each  decision  node. 

8.  For  each  decision  node  n,  compute  the  ambiguity,  generating 
btc.n. A. 

9.  For  each  decision  node  n,  compute  the  average  downbranch 
ambiguity  btc.n. ada. 


Figure  5:  The  algorithm  for  constructing  a  BTC. 

6.2  Classifi  er  Construction  Algorithms 

In  this  section  we  present  the  basic  construction  algorithms.  Because  our  fundamental  feature¬ 
finding  tool  is  an  adapted  LDB  algorithm  (see  Appendix  B),  the  classifiers  must  be  constructed 
with  the  use  of  sufficient  training  data. 

Figure  5  provides  a  simple  statement  of  the  algorithm  for  creating  a  basic  BTC,  and  Figure  6 
provides  the  algorithm  for  the  BHC. 

6.3  Classifi  er  Tree  Traversal  Algorithms 

In  this  section,  we  provide  algorithm  statements  for  traversing  the  BTC  and  BHC  trees.  That  is,  we 
discuss  how  to  use  the  constructed  trees  to  classify  an  input.  For  the  BTC,  the  traversal  is  straight¬ 
forward.  For  the  BHC,  there  are  variants  that  could  exhibit  substantially  different  performances. 


ISP  Technical  Note 


138 


27 


Mission  Research  Corporation 


Integrated  Sensors  &  Processing 


1 .  Specify  the  set  of  available  CIDs.  This  generates  bhc.numBTC. 

2.  Obtain  the  training  data  sets  for  the  C-class  problem  of  inter¬ 
est  and  specify  the  allowable  set  of  wavelets.  This  generates 
bhc.C. 

3.  For  each  distinct  CID  type,  construct  a  BTC  as  in  Figure  5.  This 
generates  the  array  bhc.btc. 

4.  Set  n  =  1 . 

5.  For  decision  node  index  n,  find  the  BTC  indices  for  all  corre¬ 
sponding  nodes  in  the  array  of  BTCs  bhc.btc.  Call  this  set  of 
BTC  indices  Tn. 

6.  Find  the  BTC  in  Tn  with  minimum  btc.n.A.  Denote  the  index  of 
this  constituent  BTC  m. 

7.  Set  bhc.btc. n.minTree  equal  to  m  for  each  BTC  in  the  set  Tn. 

8.  Increment  n  by  1 .  If  n  is  less  than  2 C,  goto  Step  5. 


Figure  6:  The  algorithm  for  constructing  a  BHC. 

6.3.1  Binary  Tree  Classifiers 

The  algorithm  for  traversing  a  BTC  is  shown  in  Figure  7.  The  basic  idea  is  to  compute  the  feature 
vector  at  a  decision  node  and  compare  it  to  the  left-  and  right-path  stored  average  feature  vectors 
for  that  node.  If  the  measured  feature  vector  more  closely  resembles  the  left  (right)  feature  vector, 
then  the  left  (right)  path  out  of  the  node  is  taken.  If  the  measured  vector  is  equally  well  correlated 
with  both  the  left  and  right  vectors,  then  a  fair  two-sided  coin  is  flipped. 

6.3.2  Binary  Hypertree  Classifiers 

The  basic  algorithm  for  traversing  a  BHC  is  provided  in  Figure  8. 


7  Illustrative  Example 

Here  we  extend  our  basic  eight-class  toy  problem  originally  defined  in  the  DARPA  TRUMPETS 
work  [2],  The  basic  idea  of  the  problem  is,  as  before,  a  sort  of  image  classification,  but  now  we 
have  multiple  image  modes  available.  So  the  classification  problem  will  be  to  identify  the  label 
of  the  object  that  gives  rise  to  the  available  image-oriented  CIDs.  Let’s  suppose  we  have  a  crude 
imager  that  produces  binary-valued  pixels  (black-and-white  camera),  a  slightly  more  sophisticated 
imager  that  produces  gray-scale  images  (gray-scale  camera),  one  that  produces  color  images  (color 
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1.  Obtain  the  data  to  be  classified  (a  CID  X). 

2.  Compute  the  classification  tree  depth  D  =  log2(C). 

3.  Set  n  =  1  and  j  =  1. 

4.  Set  Cm  =  ^  01  =  1 ,  .  .  .  ,  K,  C  =  \C\ ,  .  .  .  ,  Cjf]- 

5.  Obtain  the  children  of  node  n\  [nhnr\  =  children(n). 

6.  Set  pi  =  CorrCoef(c,  vn). 

7.  Set  pr  =  CorrCoef(c,  un). 

8.  if  pi  >  pr  then  n  =  ni  else  n  =  nr. 

9.  j  =  j  +  1. 

10.  If  j  <  D  goto  Step  4. 

1 1 .  Class  decision  is  n  -  ( 2D  -  1). 


Figure  7:  The  algorithm  for  traversing  the  BTC. 


1.  Obtain  at  least  one  CID  X  to  be  classified. 

2.  Choose  the  constituent  BTC  with  minimum  node-1  ambiguity, 
say  the  BTC  with  index  j.  Node  1  is  the  current  node. 

3.  If  no  CID  is  present  for  the  current  BTC,  request  the  CID  from 
the  sensor  suite. 

4.  Use  the  basic  BTC  algorithm  to  select  the  left  or  right  path  out 
of  current  node  in  current  BTC  j,  landing  on  node  k.  Update 
current  node  to  k. 

5.  Switch  to  BTC  bhc.btc[j].n[k].minTree,  the  minimum-ambiguity 
BTC  for  node  k.  Update  current  BTC  to  bhc.btc[j].n[k].minTree. 

6.  If  node  k  is  a  decision  node,  goto  Step  3. 

7.  Class  decision  is  k  -  ( 2D  -  1). 


Figure  8:  The  basic  algorithm  for  traversing  the  BHC. 
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1.  Obtain  at  least  one  CID  X  to  be  classified. 

2.  Choose  the  constituent  BTC  with  minimum  node-1  ambiguity, 
say  the  BTC  with  index  j.  Node  1  is  the  current  node.  Set 

jo  =  j- 

3.  If  no  CID  is  present  for  the  current  BTC,  request  the  K- 
component  feature  vector  for  the  CID  and  current  node  from 
the  sensor  suite. 

4.  Use  the  basic  BTC  algorithm  to  select  the  left  or  right  path  out 
of  current  node  in  current  BTC  j,  landing  on  node  k.  Update 
current  node  to  k.  Update  current  BTC  j  to  j0. 

5.  Switch  to  BTC  bhc.btc[j].n[k].minTree,  the  minimum-ambiguity 
BTC  for  node  k.  Update  current  BTC  to  bhc.btc[j].n[k].minTree. 

6.  If  node  k  is  a  decision  node,  goto  Step  3. 

7.  Class  decision  is  k  -  ( 2D  -  1). 


Figure  9:  The  switch- and-retum  algorithm  for  traversing  the  BHC. 

camera),  and  one  that  produces  infrared  images  (infrared  camera),  which  images  temperature  dif¬ 
ferences  in  the  underlying  physical  objects. 

We  define  the  classes  in  terms  of  the  idealized  images  they  produce  at  the  outputs  of  the  four 
cameras,  as  shown  in  Figure  10.  The  interesting  part  of  this  problem  is  to  create  a  situation  in 
which  classification  is  ambiguous  (flawed;  irreducible  nonzero  probability  of  error)  for  each  imager 
independently,  but  not  when  used  together  in  a  BHC  (as  needed  during  classification)  or  in  a  BSC. 

Discussion. 

A  casual  examination  of  the  idealized  images  in  Figure  10  reveals  that  there  are  several  completely 
ambiguous  classes  for  each  camera  type.  For  example,  for  the  black-and-white  camera,  there  are 
two  sets  of  ambiguous  classes:  the  circles  and  the  crosses.  Out  of  eight  classes,  only  two  may 
be  reliably  classified  correctly.  It  is  very  important  to  realize  that  these  problems  are  inherently 
difficult  for  any  classifier  structure,  not  just  tree-based  structures. 

In  addition  to  the  large  number  of  ambiguous  classes  for  each  camera,  note  that  no  two  cameras 
possess  the  same  set  of  ambiguous  classes.  That  is,  for  any  two  classes,  there  is  at  least  one  camera 
type  for  which  the  corresponding  images  are  distinct.  This  fact  is  crucial  to  the  success  of  a  BHC 
(or  a  BSC):  there  must  be  sufficient  discriminatory  power  in  the  collection  of  CID  types  to  allow 
good  classification  performance.  If  there  is  not,  then  the  collection  must  be  modified  in  some 
fashion,  such  as  adding  new  CID  types  (sensor  modalities  or  types). 

Sample  Hypertrees. 

The  best  BTC  for  each  camera  type  is  shown  in  Figure  11.  For  this  problem,  balanced  binary 
trees  are  not  optimal  for  the  gray-scale,  color,  and  infrared  camera  types.  For  each  of  the  four  best 
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binary  trees,  additional  trees  are  constructed  using  the  obtained  superclasses  for  each  of  the  other 
three  camera  types.  This  ensures  that  each  node  in  each  of  the  four  best  trees  has  at  least  three 
corresponding  nodes.  The  four  sets  of  four  BTCs  are  shown  in  Figures  12-15  for  the  black-and- 
white,  gray-scale,  color,  and  infrared  cameras,  respectively. 

Note  that  no  single  constituent  BTC  is  free  from  highly  ambiguous  nodes.  However,  through 
the  use  of  the  hypertree,  we  can  find  an  unambiguous  path  through  the  collection  of  BTCs  for  each 
input  class,  guaranteeing  good  performance.  The  hypertree  indices  are  not  shown  in  the  figures  to 
keep  the  figures  legible.  Let  us  illustrate  the  BHC  operation  with  a  few  examples  next. 

Operation. 

First  suppose  that  the  true  class  label  is  1,  and  we  are  provided  initially  with  an  output  from  the 
black-and-white  (B&W)  camera.  So  we  know  that  we  start  in  BTC  1  (see  Figure  12).  Node  1 
of  tree  1  points  to  itself,  so  the  processing  is  performed  on  the  B&W  camera  output,  resulting  in 
taking  the  left  branch  out  of  node  1  and  landing  on  node  2  in  tree  1.  This  node  is  ambiguous  and 
points  to  tree  3,  node  2.  To  continue,  the  ISP  system  must  request  a  color  camera  output.  It  then 
processes  this  new  sensor  output  and  takes  the  left  branch  out  of  node  2,  landing  on  node  4,  which 
is  a  terminal  node  and  is  correct.  The  traversal  of  the  hypertree  can  be  summarized  in  tabular  form 
in  the  following  way: 


(Tree,  Node) 

Camera 

Get  New  Data? 

Branch  To  Jump  To 

(1.1) 

B&W 

No 

(1.2) 

(1.2) 

B&W 

No 

(3,  2) 

(3,  2) 

Color 

Yes 

(3,  4) 

Next  consider  that  the  true  class  label  is  2  and  we  again  start  out  with  a  B&W  camera  output. 

The  following  hypertree  traversal  results. 

(Tree,  Node) 

Camera 

Get  New  Data? 

Branch  To  Jump  To 

(1.1) 

B&W 

No 

(1.2) 

(1.2) 

B&W 

No 

(3,  2) 

(3,  2) 

Color 

Yes 

(3,  5) 

(3,  5) 

Color 

No 

(12,  5) 

(12,  5) 

Gray-Scale 

Yes 

(12,  9) 

(12,  9) 

Gray-Scale 

No 

(13,  9) 

(13,  9) 

IR 

Yes 

(13,  14) 

We  see  that  to  properly  classify  class  2  starting  with  a  B&W  image,  we  require  all  three  additional 
camera  outputs.  The  situation  is  quite  different  for  classes  3  and  4.  For  example,  for  class  3  and  an 
initial  image  from  the  B&W  camera,  we  have  the  following  hypertree  traversal: 

(Tree,  Node) 

Camera 

Get  New  Data? 

Branch  To  Jump  To 

(1.1) 

B&W 

No 

(1.3) 

(1.3) 

B&W 

No 

(1.6) 

(1.6) 

B&W 

No 

(1.12) 

For  this  example,  no  additional  camera  outputs  are  required  for  good  classification.  A  similar  result 
holds  for  class  4. 

Notice  that  if  we  have  a  class-3  input  and  we  begin  with  an  IR-camera  output,  we  do,  in  fact, 
require  additional  camera  outputs  to  correctly  classify  the  input: 
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(Tree,  Node) 

Camera 

Get  New  Data? 

Branch  To 

Jump  To 

(4,1) 

IR 

No 

(4,  2) 

(4,  2) 

IR 

No 

(4,  4) 

(4,  4) 

IR 

No 

(16,  4) 

(16,  4) 

Gray-Scale 

Yes 

(16,  8) 

8  Extensions 

The  tree-based  classification  methods  outlined  in  this  report  are  quite  general  in  two  important 
respects.  First,  they  can  accommodate  widely  differing  input  (sensor  output)  data  types.  For  exam¬ 
ple,  the  available  sensor  outputs  can  consist  of  two-dimensional  optical  images,  two-dimensional 
range-doppler  radar  returns,  one-dimensional  high-range  resolution  (HRR)  returns,  SAR  images, 
infrared  images,  sound  records,  etc.  In  other  words,  the  hypertree  classification  architecture  is 
naturally  suited  for  problems  in  which  data  fusion  is  essential. 

The  second  way  in  which  the  classifiers  exhibit  great  generality  and  flexibility  is  in  the  specific 
choice  of  classifier  type.  We  need  not  restrict  ourselves  to  tree-based  classifiers  in  order  to  reap  the 
benefits  of  the  ISP-enabled  hypertree  system.  All  that  is  required  is  the  ability  to  detect  ambiguous 
decisions  and  a  way  to  link  ambiguous  decisions  to  the  collection  of  the  best  new  data  set  for 
resolving  the  detected  ambiguity. 


9  Conclusions 

We  have  documented  our  initial  research  efforts  in  the  area  of  binary  hypertree  classifiers  (BHCs). 
The  core  notion  is  the  generalization  of  simple  binary-tree  classifiers  (BTCs)  to  ISP  systems  in 
which  traversing  the  tree  is  linked  to  sensor  controls  so  that  as  ambiguous  situations  are  encoun¬ 
tered,  further  sensor  data  is  requested  from  the  sensor  with  the  most  discriminatory  power  for  the 
situation.  In  this  way  processing  (feature-based  classification)  is  strongly  integrated  with  sensing. 

A  mathematical  framework  for  hypertree  classifiers  is  laid  out,  some  performance  analysis 
results  are  obtained,  and  algorithms  for  tree  construction  and  traversal  are  presented.  Future  work 
will  focus  on  creating  a  simulator  and  extending  the  mathematical  analysis. 
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Figure  10:  Idealized  images  for  the  eight-class  illustrative  example. 
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Figure  11:  The  best  binary-tree  classifiers  for  each  of  the  four  camera  types.  The  ambiguous  nodes 
are  highlighted  in  red. 
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Figure  12:  The  binary-tree  classifiers  for  the  tree  specified  by  the  best  BTC  for  the  black-and-white 
camera.  Ambiguous  nodes  are  highlighted  in  red. 
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Figure  13:  The  binary-tree  classifiers  for  the  tree  specified  by  the  best  BTC  for  the  gray-scale 
camera.  Ambiguous  nodes  are  highlighted  in  red. 
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Ambiguous  nodes  are  highlighted  in  red. 
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Ambiguous  nodes  are  highlighted  in  red. 
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Appendices 

A  Wavelets  and  Wavelet  Packets 

Wavelet  Transforms. 

The  wavelet  transform  of  an  iV  x  TV  image  is  defined  by  a  pair  of  quadrature  mirror  filters  (QMFs) 
h(-)  and  g(-)  and  a  maximum  decomposition  depth  J  [5].  The  filters  h  and  g  are  low-  and  high- 
pass  filters,  respectively.  The  transform  iteratively  applies  the  filters  to  the  rows  and  columns  of 
the  image,  subsamples  the  results,  and  then  starts  over  with  the  subsampled  data.  The  filters  are 
applied  in  all  four  of  their  row-column  combinations:  low-low  (LL),  low-high  (LH),  high-low 
(HL),  and  high-high  (HH).  At  each  iteration,  the  convolution-sampling  operation  is  applied  to  the 
LL  data  only,  while  the  other  three  data  sets  are  retained  as  is.  At  the  final  stage  (stage  J )  the  LL 
coefficients  are  also  retained.  For  example,  Figure  16  shows  an  image  and  its  decomposition  for 
J=  1. 

It  turns  out  that  this  iterative  filtering  and  decimating  process  corresponds  to  data  decomposi¬ 
tion  using  a  set  of  images  that  form  a  basis  for  all  square- summable  images.  Let  these  basis  images 
be  denoted  by  v(j,  k,l),  j  =  0,1 ,J,  k  =  0, 1, ... ,  k(j),  and  /  =  0, 1, ... ,  4T0~i  —  1,  where 
N  =  2n°  is  a  dyadic  number.  Then  the  image  data,  denoted  by  x,  can  be  represented  by 

x  =  X}c^v(j,  k,l),  (2) 

j,k,l 

where  C(.)  is  a  set  of  coefficients.  The  variable  index  maximum  k(j)  is  equal  to  0  for  j  =  0,  to  3 
for  j  =  J,  and  to  2  otherwise. 

Throughout  the  remainder  of  this  report,  the  terms  basis  image  and  basis  vector  are  used  inter¬ 
changeably.  This  usage  emphasizes  the  strong  connections  to  vector-space  ideas  and  underscores 
that  most  of  the  discussion  is  applicable  to  D-dimensional  data  sets  for  D  >  2. 

Wavelet  Packets. 

The  wavelet  packet  decomposition  of  an  image  is  closely  related  to  the  wavelet  transform.  Instead 
of  iteratively  applying  the  QMFs  to  the  LL  data  only,  they  are  iteratively  applied  to  each  of  the  four 
data  sets  LL,  LH,  HL,  and  HH.  This  results  in  a  set  of  vectors  that  contains  many  distinct  bases, 
including  the  basis  corresponding  to  the  wavelet  transform.  Let  this  large  set  of  linearly  dependent 
vectors  be  denoted  by  w(j,  k,l),  j  =  0, 1, . . . ,  J,  k  =  0, 1, . . . ,  4?  —  1,  l  =  0, 1, ... ,  4n°~I  —  1, 
and  let  the  operator  W j^,i  denote  the  transformation  of  the  image  data  x  to  the  coefficient  that 
corresponds  to  the  basis  image  w (j,k,l), 

Cj,k,i  =  Wjijfx], 

Then  the  image  data  is  represented  by 

x  =  ^2cjik:lw(j,k,l) 

j,k,l 

=  (wWxDwClm)- 

j,k,l 
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The  image  subspace  spanned  by  the  vectors  w(j,  k.  •)  is  denoted  by  B(j.  k).  Each  node  in  the 
tree  of  Figure  17  is  associated  with  a  single  subspace  B(j ,  k ).  An  orthogonal  basis  is  an  orthogonal 
collection  of  the  B(j,  k )  that  spans  the  image  space. 

The  wavelet  packet  decomposition  is  illustrated  in  Figure  17.  Note  that  the  wavelet  transform 
consists  of  all  extreme-left  nodes  and  their  immediate  children. 

An  Alternate  Subspace-Indexing  Scheme. 

In  the  previously  described  indexing  scheme  for  w(j,  k.  1),  the  variable  j  denotes  the  depth  in  the 
decomposition  tree  (Figure  17),  k  denotes  the  node  number  at  depth  j  (starting  with  k  —  0  on 
the  left),  and  l  denotes  the  particular  element  in  the  matrix  associated  with  node  k  at  depth  j.  For 
example,  the  filled  node  in  Figure  17  corresponds  to  j  =  2,  k  =  14,  and  it  is  associated  with  an 

2n°-2  by  2n°-2  matrix,  whose  elements  are  indexed  by  l  (starting  with  l  =  0)  after  creating  a  vector 
by  concatenating  its  rows. 

In  computer  programs  that  implement  decomposition  trees  like  that  in  Figure  17  (see  WaveFab 
[6]),  it  can  be  more  convenient  to  use  a  different  indexing  scheme.  In  this  alternate  scheme,  the 
basis  images  are  also  indexed  by  an  ordered  triplet  (j,  k.  I ),  where  j  denotes  the  absolute  node 
number,  and  the  pair  (k,l)  denotes  the  matrix  element  associated  with  node  j.  For  example,  the 
filled  node  in  Figure  17  corresponds  to  j  =  20,  and  it  is  associated  with  an  2n°-2  by  2n°-2  matrix, 
so  that  k  =  1, . . . ,  2no-2,  and  1  =  1,...,  2n°~* 2. 

We  refer  to  the  first  numbering  system  as  Saito’s  numbering  system  (SNS)  [  ]  and  to  the  alter¬ 
nate  as  the  wavetree  numbering  system  (WNS),  after  MRC’s  MATFAB  data  structure  wavet  ree. 


Figure  16:  Wavelet  transform  for  J  =  1,  which  is  also  the  wavelet  packet  for  J  =  1.  The 
Daubechies  wavelet  with  parameter  20  is  used. 


Analogy  to  the  Two-Dimensional  Fourier  Transform. 

The  representation  of  the  image  x  as  a  weighted  sum  of  basis  images  might  be  better  understood  by 
analogy  with  the  simpler  and  more  familiar  two-dimensional  discrete  Fourier  transform  (2D-FT). 
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The  2D-FT  for  the  image  x  =  {x(u,  w)}  is 

v(f„  /„)  =  £  £>(u, 

U=1  V=1 

and  its  inverse  is  simply 

x(u,  „)  =  ££  y(h,  }sy^l.uiN^M» 

fx  —  1  /t/  — 1 

A  single  element  from  the  inverse-transform  sum  is 

y(r,  s )  ? 

which  is  a  scaled  image  in  u  and  v.  The  image  is  trivially  obtained  by  inverse  transforming  the 
2D-FT  that  is  zero  everywhere  except  for  fx  =  r.  fy  =  s. 

The  wavelet  transform,  or  any  other  transform  based  on  the  wavelet  packet,  works  in  a  similar 
way,  although  the  transform  is  more  complex.  However,  we  can  obtain  the  basis  images  by  inverse 
transforming  a  set  of  coefficients  that  are  all  zero  except  in  the  desired  location  (  j.  k.  I ). 

Wavelet-Packet  Processing. 

The  goal  in  wavelet-packet  compression  is  to  search  over  all  possible  bases  corresponding  to  a 
wavelet  packet  for  the  one  that  possesses  maximum  energy  compaction  with  respect  to  a  target 
class  of  interest.  This  simply  means  that  the  energy  of  the  decomposed  data  is  concentrated  in  the 
fewest  possible  basis  coefficients.  If  the  number  of  significant  basis  coefficients  is  small  compared 
to  N2,  then  by  representing  the  data  as  the  values  of  these  few  coefficients,  a  large  degree  of 
compression  is  obtained  at  a  small  loss  in  fidelity. 

The  goal  in  wavelet-packet  classification  is  to  search  over  all  possible  bases  for  the  one  that 
possesses  the  basis  vectors  that  have  the  maximum  discrimination  power  over  all  input  target 
classes  of  interest.  The  discrimination  power  is  quantified  by  a  distance  measure.  Such  a  basis  is 
referred  to  as  a  local  discriminant  basis  (LDB)  [  ].  The  ideal  LDB  is  one  for  which  a  very  small 
number  (compared  to  N 2)  of  basis  coefficients  can  be  used  to  reliably  determine  the  class  to  which 
an  input  data  set  belongs.  LDBs,  and  how  to  obtain  them,  are  the  focus  of  the  next  section. 


B  Local  Discriminant  Bases 

The  material  in  this  appendix  is  excerpted  from  [2], 

LDBs  are  obtained  only  in  the  context  of  a  specific  set  of  target  classes  of  interest.  Suppose  we 
have  C  classes  of  interest,  and  we  have  Nc  training  images  for  class  c,  c  =  1, . . . ,  C,  provided  in 
sets  Xc.  It  is  assumed  that  the  training  images  are  representative  of  their  respective  classes.  The 
fundamental  idea  behind  the  LDB  is  to  find  a  basis  such  that  there  are  a  few  basis  vectors  whose 
coefficients  vary  widely  among  the  classes  while  varying  little  between  members  of  a  class.  Before 
we  can  determine  whether  a  vector  can  provide  good  discrimination  between  classes,  we  need  to 
know  something  about  the  behavior  of  the  vector’s  coefficients  within  each  class.  Specifically,  we 


ISP  Technical  Note 


153 


42 


Mission  Research  Corporation 


Integrated  Sensors  &  Processing 


Figure  17:  Illustration  of  the  wavelet  packet  decomposition. 

need  to  establish  a  measure  of  the  average  strength  of  the  coefficients  throughout  the  packet  for 
each  class.  The  strength  measure  could  be  average  value,  average  absolute  value,  average  energy, 
and  others  [  ].  Then  we  need  to  establish  a  measure  of  the  distance  or  difference  between  the 
basis  coefficients  for  two  or  more  classes;  this  distance  measures  the  discrimination  power  of  the 
corresponding  basis  image. 

Coarse  LDB  Algorithm. 

To  find  the  LDB,  and  the  best  vectors  in  the  LDB,  we  perform  the  steps  shown  in  Figure  18,  which 
we  shall  expand  upon  in  the  remainder  of  this  section. 


1.  Obtain  the  training  data  sets. 

2.  Select  a  wavelet  by  choosing  a  QMF  pair. 

3.  For  each  target  class,  find  the  energy  in  its  average  wavelet 
packet  decomposition  using  the  training  data. 

4.  Search  over  the  bases  in  the  packet  for  the  one  with  the  largest 
differences  in  average  interclass  energy. 

5.  For  the  selected  basis,  order  the  basis  vectors  by  their  discrim¬ 
ination  power. 


Figure  18:  A  coarse  statement  of  the  algorithm  for  finding  the  local  discriminant  basis  (LDB). 
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B.l  Constructing  the  Average  Packet-Energy  Map 

The  goal  of  this  step  in  the  LDB  algorithm  is  to  characterize  the  average  energy  contained  in  the 
subspaces  L?(j,  k)  for  each  target  class.  This  gives  us  an  idea  of  where  the  energy  is  concentrated 
for  the  various  classes,  which  can  be  used  to  select  subspaces  that  possess  greatly  varying  average 
energies  over  the  input  classes,  indicating  good  discrimination  power. 

The  baseline  average  packet-energy  map  [  ]  for  class  c  is  defined  by  Tc(j,  k.  I ): 

rcU,M)=  T  £  (w^M)2,  0) 

C  i:xi£Xc 

where  the  class  energy  Ec  is  defined  by 


i-.Xi^Xc 


The  energy  map  Tc  has  the  advantages  of  conceptual  and  mathematical  simplicity,  and  it  does 
indicate  the  subspaces  that  are  particularly  energetic,  but  it  has  drawbacks  for  LDB-based  ATR. 
In  particular,  it  cannot  distinguish  between  subspaces  that  have  similar  energy  by  coefficients  with 
opposing  signs;  such  pairs  of  subspaces  may  in  fact  be  very  useful  for  classification.  To  avoid  this 
drawback,  an  alternate  energy  measure  is  simply  the  average  value  map : 

<XcU,k,l)=  T  £  (4) 

C  i-.Xi£Xc 


This  map  can  be  especially  useful  in  conjunction  with  the  variance  map: 


Vc(j,k,l)  = 


TT  (Wwm[x*])2  ~ac{j,k,l)2 

■L'C  ■  s-  v 
L  six^EAc 


(5) 


Because  there  are  competing  energy-measuring  functions,  the  generic  energy-map  function  is 
denoted  by  Ec(j.  k1  /).  The  energy  for  a  subspace  B(j.  k)  is  simply  the  sum  of  energies  of  its 
components. 


B.2  Searching  for  the  Best  Discriminant  Basis 

The  goal  of  this  processing  step  is  to  use  the  energy  maps  to  determine  the  LDB.  The  idea,  as 
previously  stated,  is  to  find  the  nodes  in  the  packet  tree  (or,  equivalently,  the  subspaces  B(j1  k)) 
that  possess  a  distinctly  different  average  energy  for  each  class.  To  find  these  nodes,  we  need 
to  define  a  measure  of  distance  between  the  subspace  energy  functions  {Ec(j,  k,  To  be 

general,  we  base  the  distance  measure  D(-)  on  a  pairwise  distance  measure  L)p{-.  •): 

M  M 

D({yi}iii)  =  ^2  DP(yj,yk)  (6) 

j= 1  k=j+l 

The  pairwise  distance  measure  can  be  Euclidean  distance,  relative  entropy,  or  others  (see  [  ],  page 

66). 
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The  baseline  subspace  energy  distance  function  [  ]  can  now  be  defined: 

4«0-j_i 

D({Ec(j,  k,  -)}f=i)  =  D({Ec(j,  k)}°=1)  =  Y  DdE !i0‘>  ki l)-.  ^2  O',  Ec(j ,  0))- 

z=o 

This  is  the  sum  over  all  unique  pairs  of  the  pairwise  distances  for  elements  of  the  subspace  B(j ,  k). 
Note  that  if  the  energy  in  the  vectors  for  each  class  is  distinct  from  the  energy  in  the  other  classes, 
the  sum  of  pairwise  differences  will  be  large. 

Denote  the  LDB  by  the  collection  of  subspaces  A(j.  k)  and  denote  the  children  of  subspace 
B(j,  k )  by  {B(j  +  l,i)  :  i  E  I(j,  k)}.  The  algorithm  for  obtaining  the  LDB  is  stated  in  Figure 
19.  Step  6  is  the  key  step.  The  idea  is  to  compare  the  discrimination  power  of  a  subspace  to  the 
sum  of  the  discrimination  powers  of  all  its  children.  If  the  children  are  better  discriminators,  then 
they  determine  the  composition  of  the  best-basis  subspace,  but  if  the  subspace  itself  is  a  better 
discriminator,  then  it  is  retained  as  the  best-basis  subspace. 


1 .  Choose  the  desired  QMFs  for  the  wavelet  packet  of  interest. 

2.  Specify  the  maximum  decomposition  depth  J. 

3.  Specify  the  pairwise  distance  measure  Dp(-,  •)  and  the  sub¬ 
space  energy  measure  Ec(j,  k,  l ). 

4.  Compute  the  energy  maps  Ec  for  c  =  1, 2, . . . ,  C. 

5.  Set  A(J,k)  =  B(J,k)  and  A (J,k)  =  D({Ec(J,k)}^=1 )  for  k  = 

0, 1, . . . ,  4J  -  1.  This  initializes  the  LDB  to  the  set  of  subspaces 
at  the  lowest  level  of  the  decomposition  (viewed  in  terms  of  a 
tree). 

6.  Find  the  best  subspace  A(j,  k)  for  j  =  J  -  1, ...  ,0  and  k  = 
0, 1, . . . ,  4j  -  1  by  using  the  following  rule: 

Set  A{j,k)  =  D({Ec{j,k)}gzl). 

If  A (J,k)>  Y  Ati  +  1,n) 

nel(j,k) 

then  set  A(j,  k )  =  B(j,  k) 

else  set  A(j,k)  =  [J  A(j  +  1  ,n)  and  set  A(j,  k)  = 

nel(j,k) 

Y  A(j  +  l,n). 

nel(j,k) 

7.  The  LDB  is  A( 0, 0). 


Figure  19:  A  detailed  statement  of  the  algorithm  used  to  find  the  local  discriminant  basis  (LDB). 
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B.3  Ordering  Basis  Vectors  by  Discriminant  Power 

The  final  step  in  obtaining  the  LDB  is  to  order  the  individual  basis  vectors  in  terms  of  their  discrim¬ 
ination  power.  A  natural  choice  for  the  measure  of  discrimination  power  is  the  distance  measure 
used  in  obtaining  the  best  basis.  When  obtaining  the  best  basis,  the  measure  was  applied  to  sub¬ 
spaces  B(j ,  k).  Here,  the  measure  is  applied  to  the  individual  vectors  in  the  subspaces  making  up 
the  best  basis.  The  vectors  are  then  sorted  in  decreasing  order  with  respect  to  their  distance. 

An  alternative  to  using  the  best-basis  distance  measure  to  order  the  basis  vectors  involves  using 
the  average-value  energy  map  ac(j,  k,l )  together  with  the  variance  map  Vc(j,  k,  l).  The  primary 
difficulty  with  using  the  baseline  energy  map  Tc  together  with  the  distance  measure  Dp  is  that  a 
vector  that  possesses  a  high  average  energy  for  a  target  class  can  also  have  a  high  variance  for  the 
images  within  the  class,  which  can  substantially  limit  its  applicability  for  classification.  The  ideal 
vector  has  the  following  properties 

1.  Maximum  mean  values  over  the  C  target  classes, 

2.  Maximally  distinct  mean  values  over  the  C  classes, 

3.  Small  variance  for  each  class. 

Motivated  by  these  considerations,  we  can  define  a  distance  between  two  elements  of  the  best 
basis  that  is  particularly  suitable  for  classification  purposes.  For  classes  p  and  q,  define 


mi  =  ai(j,k,l), 

Si  =  Vi{j,k,l), 


for  i  =  p,q.  Then  the  pairwise  distance  is  defined  by 


mq  >  mp 
mq  <  rnp. 


The  distance  between  more  than  two  classes  can  be  the  sum  of  the  pairwise  distances, 


M  M 


D(w (j,k,l))  =  ^2  ^2  Dv(w(j,k,l),p,q), 


p= 1  q=p+ 1 


or  the  minimum  over  all  pairs 


D(w(j,  k,l))  =  min  Dv (w(j,  k,  l),p,q). 
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Abstract 

Recently  developed  binary-tree-based  classifiers  are  studied  through  simulation  experiments.  The 
two  distinct  classifiers  are  general  and  are  applicable  to  a  wide  variety  of  classification  problems. 
The  classifiers  accept  one-  and  two-dimensional  inputs  and  can  be  easily  generalized  to  higher 
dimensions.  The  core  classifier  is  the  binary  tree  classifier  (BTC).  This  classifier  employs  the 
local  discriminant  basis  (LDB)  to  automatically  and  jointly  determine  the  best  tree  topology  and 
feature-vector  values  for  each  decision-node  in  the  tree.  The  binary  hypertree  classifier  (BHC) 
combines  several  BTCs  for  the  special  situation  in  which  multiple  input  data  types  are  available 
for  each  class.  For  example,  the  objects  may  be  viewed  with  one  of  several  distinct  camera  types. 
A  BTC  is  created  for  each  camera  type  and  the  BHC  combines  these  in  an  efficient  manner  so 
that  performance  is  maximized  while  using  a  minimum  of  input  data  types.  The  performance  of 
these  and  related  classifiers  is  evaluated  in  this  report  via  simulation  experiments  for  one-  and  two- 
dimensional  inputs.  It  is  shown  that  the  classifiers  are  very  good  at  automatically  determining  the 
structure  of  a  given  problem  and  extracting  the  most  useful  feature  subsets  for  inclusion  in  the  tree 
structures. 
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1  Introduction 

The  ability  to  automatically  determine  the  type  of  a  remote  target,  such  as  a  tank,  bus,  or  armored 
vehicle,  is  crucial  to  successful  military  operations  because  it  helps  sort  out  friends,  foes,  and 
neutrals  in  the  chain  of  weapons-targeting  operations.  The  subject  of  this  report  is  a  new  class 
of  systems  for  automatic  target  recognition  (ATR)  that  is  developed  mathematically  in  a  compan¬ 
ion  report  [  ].  The  new  systems  employ  the  local  discriminant  basis  (LDB)  [  ]  and  binary-  tree 
classifiers  in  an  attempt  to  automatically  identify  and  characterize  the  highly  discriminatory  data 
elements  and,  simultaneously,  minimize  the  required  number  of  sensor  modalities  for  accurate 
classification.  In  other  words,  the  systems  attempt  to  reap  the  performance  gains  of  a  full-blown 
sensor-fusion  approach  by  selectively  requesting  new  data  or  modalities  only  when  required  to 
eliminate  a  class  ambiguity. 

This  report  presents  a  set  of  experimental  results  aimed  at  proof-of-concept  for  the  binary  tree 
classifiers  presented  in  [  ].  We  aim  to  answer  the  following  questions.  Can  the  developed  struc¬ 
tures  automatically  determine  the  low-dimensional  data  subspace  that  provides  optimal  classifica¬ 
tion  performance?  Can  the  developed  hypertree  structures  minimize  or  substantially  reduce  the 
amount  of  data  needed  for  a  given  performance  level  when  compared  to  a  classifier  that  fuses  all 
available  sensor  data?  To  provide  the  proof-of-concept  and  answers  to  these  questions,  we  apply 
the  developed  machinery  to  several  problems  involving  simulated  one-  and  two-dimensional  data 
sets,  as  well  as  to  publicly  available  data  sets  used  in  classification-system  research  and  develop¬ 
ment. 

The  remainder  of  this  report  is  organized  as  follows.  The  binary-tree  classification  structures 
are  reviewed  in  Section  2,  and  the  methods  of  training  the  classifiers  are  described  in  Section  3. 
Classifier  operation  (a  relatively  simple  task  compared  to  training)  is  briefly  described  in  Section 
4  (see  also  [1]).  The  various  experiments  and  results  are  described  in  Section  5,  and  concluding 
remarks  are  provided  in  Section  6. 


2  Binary  Trees  for  Classifi  cation 

In  this  section,  we  review  the  three  tree-based  classifier  structures  developed  in  [  ]:  the  binary 
tree  classifier  (BTC),  binary  supertree  classifier  (BSC),  and  the  binary  hypertee  classifier  (BHC). 
The  BTC  is  a  classifier  that  uses  a  single  sensor  modality  (called  here  a  classifier  input  data  (CID) 
type)  and,  usually,  one  wavelet  type.  Its  topology  and  the  classes  involved  at  the  decision  nodes 
can  be  fixed  in  advance  of  training  or  chosen  jointly  during  training.  The  BSC  is  a  BTC  that  uses 
all  available  CIDs  types  (concatenated  together)  as  its  input;  it  represents  a  fusion  algorithm.  The 
BHC  links  together  an  arbitrary  number  of  BTCs  or  BSCs  to  form  a  structure  that  jumps  from  CID 
to  CID  as  needed  or,  more  generally,  from  BTC  to  BTC  as  needed.  The  performance  of  the  BHC 
is,  in  some  cases,  equal  to  that  of  the  BSC,  but  the  average  data  requirement  should  be  smaller. 

2.1  The  Binary  Tree  Classifi  er  (BTC) 

The  binary  tree  classifier  has  structure  as  illustrated  by  the  example  in  Figure  1 .  The  tree  can  be 
balanced  or  unbalanced;  the  particular  connections  of  decision  and  terminal  nodes  is  called  the  tree 
topology.  For  each  decision  node  in  the  BTC,  there  are  several  quantities  that  need  specification: 
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1.  Measurement  Vector  w.  This  vector  has  length  K  and  specifies  the  K  wavelet  operations 
used  to  make  the  binary  decision  at  the  node. 

2.  Left  Superclass  Set  L  and  Right  Superclass  set  R.  Each  decision  node  makes  a  single  binary 
decision  between  the  superclasses  on  the  left  and  right.  The  union  of  L  and  R  for  any 
decision  node  is  equal  to  the  set  of  all  classes  inherited  from  the  node’s  parent.  For  example, 
refering  to  Figure  1,  node  1  (the  root  node)  splits  the  class-label  set  {1,  2,  3,  4,  5,  6,  7,  8}  into 
nonempty  subsets  Li  —  { 1,  2,  3,  4,  5}  and  Ri  —  {6,  7,  8}. 

3.  Left  Average  Feature  Vector  v.  This  vector  has  length  K  and  is  the  average  value  of  the  K 
measurements  specified  by  w  using  the  classes  contained  in  superclass  set  L. 

4.  Right  Average  Feature  Vector  u.  This  vector  has  length  K  and  is  the  average  value  of  the  K 
measurements  specified  by  w  using  the  classes  contained  in  superclass  set  R. 

There  are  many  possible  tree  topologies,  and  for  each  of  these,  there  are  many  choices  for  su¬ 
perclass  selection.  The  idea  is  to  jointly  select  the  topology,  superclasses,  and  measurement  vectors 
such  that  the  tree  has  optimal  performance.  The  code  developed  during  this  work  is  general  enough 
to  accommodate  a  fixed  topology,  a  fixed  topology  with  fixed  superclasses,  or  a  free  topology  and 
free  superclass  selection. 

2.2  The  Binary  Supertree  Classifi  er  (BSC) 

The  binary  supertree  classifier  is  simply  a  BTC  with  a  multimodal  classifier  input  data  (CID)  type. 
The  data  for  each  CID  is  concatenated  to  form  the  “super”  input  to  the  BSC. 

2.3  The  Binary  Hypertree  Classifi  er  (BHC) 

The  binary  hypertree  classifier  connects  two  or  more  BTCs  that  are  aimed  at  solving  the  same 
classification  problem.  A  graphical  depiction  of  a  BHC  is  shown  in  Figure  2.  This  particular 
BHC  comprises  four  BTCs,  each  associated  with  a  unique  CID.  The  key  idea  is  that  each  decision 
node  in  each  BTC  has  an  ambiguity  associated  with  it  such  that  for  low  ambiguity  (near  zero), 
the  decision  made  at  the  node  is  almost  always  correct  (low  probability  of  error),  and  for  high 
ambiguity  (near  1),  the  decision  is  wrong  with  probability  near  one-half.  Some  BTCs  may  have 
several  decision  nodes  with  very  low  ambiguity  and  several  with  high  ambiguity.  The  BHC  points 
each  node  of  each  BTC  to  the  constituent  BTC  having  the  node  with  lowest  ambiguity  and  same 
union  of  L  and  R. 
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Figure  1:  A  typical  BTC  for  an  eight-class  problem. 
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Figure  2:  Illustration  of  the  hypertree  idea. 
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3  Classifi  er  Parameter  Specif!  cation  and  Training 

In  this  section  we  review  the  manner  in  which  BTCs  and  BHCs  are  specified  starting  from  a  training 
data  set  and  ending  with  an  operational  classifier. 

3.1  Tree  Topology 

The  tree  topology  is  the  set  of  connections  between  the  tree  nodes.  The  tree  can  be  balanced,  as  in 
Figure  3,  or  unbalanced,  as  in  Figure  1.  The  best  topology  for  a  particular  classification  problem 
is  dependent  on  the  statistical  nature  of  the  problem.  In  general,  it  is  advantageous  to  determine 
the  topology  in  some  adaptive  manner  that  employs  measurements  on  the  training  data.  Analysis 
of  balanced  and  unbalanced  BTCs  is  reported  in  [1]. 

The  tree  topologies  are  specified  using  an  integer-valued  vector,  whose  structure  is  explained 
here.  Each  successive  pair  of  integers  in  the  vector  specify  the  sizes  of  the  left  and  right  superclass 
sets  for  a  node  in  the  tree.  The  convention  used  to  create  the  vector  is  to  start  at  the  root  node  and 
take  the  left  path  out  of  that  node  to  find  the  next  node.  Record  the  two  integers  specifying  that 
node’s  superclass  sizes  and  continue  taking  the  left  path  out  of  each  node  until  a  terminal  node  is 
reached,  then  travel  back  up  the  tree,  taking  the  first  available  right-going  path  that  is  found.  This 
is  repeated  until  all  decision  nodes  are  specified  in  the  vector.  The  balanced  tree  of  Figure  3,  for 
example,  is  specified  by 

T  =  [44221111221111], 

and  the  unbalanced  tree  of  Figure  1  is  specified  by 

T  =  [53141321111211]. 


3.2  Superclass  Selection 

For  a  fixed  topology  T,  there  are  many  choices  for  the  sets  of  left  and  right  superclasses  denoted 
by  {Li}  and  { /?, } .  respectively.  The  superclasses  may  be  specified  in  advance,  which  would  also 
implicitly  define  the  topology.  This  may  be  reasonable  when  certain  class  subsets  are  obviously 
related  in  some  manner,  such  as  those  classes  whose  images  consist  of  arcs  and  those  that  consist 
of  straight  edges.  But  in  most  cases,  it  will  not  be  obvious  how  to  specify  the  best  topology  or 
superclass  sets,  and  so  we’ll  need  an  algorithm  that  finds  them  automatically  during  training. 

3.3  Feature-Vector  Specif!  cation 

If  the  topology  and  superclasses  are  specified,  then  there  are  two  remaining  classifier  parameters 
to  specify:  the  measurement  vectors  and  the  average  feature  vectors.  The  measurement  vector  is 
determined  by  using  the  local  discriminant  basis  (LDB)  [5]  at  each  decision  node  [  ,  ].  This 
algorithm  finds  the  best  wavelet  basis  (using  the  specified  mother  wavelet  type)  for  representing 
the  classes  for  the  purpose  of  discriminating  between  classes.  This  operation  is  analogous  to  the 
more  familiar  best-basis  algorithm  for  finding  the  best  wavelet  basis  for  representing  the  classes 
for  the  purpose  of  compression.  The  feature-vector  specification  algorithm  then  finds  the  best 
K  vectors  in  the  LDB.  These  K  wavelet  basis  vectors  are  then  used  as  the  measurement  vector. 
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Finally,  the  algorithm  uses  the  obtained  measurement  vector  to  find  the  average  feature  values  for 
the  two  involved  superclasses.  During  operation,  the  measurement  vector  is  applied  to  the  data  to 
be  classified,  and  the  resultant  /i’-vcctor  is  compared  to  the  left  and  right  average  feature  vectors. 

3.4  Joint  Tree  Topology,  Superclass  Selection,  and  Feature-Vector  Specifi  - 
cation 

The  key  algorithm  in  this  work  is  the  algorithm  for  joint  specification  of  the  topology,  superclasses, 
and  feature  vectors  for  a  generic  BTC.  We  have  not  proved  that  this  algorithm  produces  the  opti¬ 
mal  classifier,  but  our  experiments  show  that  the  algorithm  is  adept  at  automatically  determining 
appropriate  topologies,  superclasses,  and  features,  and  that  performance  can  be  quite  good.  The 
difficulty  with  finding  the  optimal  set  of  tree  parameters  is  computational.  For  problems  with  even 
a  modest  number  of  classes  C,  the  number  of  topologies  times  the  number  of  superclass  selections 
is  very  large.  This  is  further  compounded  when  there  is  more  than  one  sensor  modality,  so  that  the 
number  of  CIDs  M  is  greater  than  one. 

The  joint  tree-specification  algorithm  is  not  optimal  because  it  does  not  examine  all  possible 
combinations  of  tree  topology  and  superclass  selection.  Instead,  it  assumes  that  good  superclass 
selections  can  be  made  for  each  node  by  properly  modifying  a  previously  chosen  superclass  split. 
In  particular,  the  algorithm  first  finds  the  best  split  that  puts  one  class  in  L  and  the  remaining 
classes  in  R.  By  “best  split”  we  mean  the  choice  resulting  in  lowest  ambiguity.  Then  the  algorithm 
finds  the  best  choice  for  adding  one  of  the  elements  of  R  to  L,  and  so  on.  The  detailed  algorithm 
statement  is  shown  in  Figure  4. 
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1 .  Obtain  a  set  of  training  data  for  the  C-class  problem  of  interest. 

2.  Select  a  specific  CID  type. 

3.  Initialize  a  BTC  structure: 

(a)  Specify  the  wavelet  type  (e.g.,  Haar,  Daubechies,  Symmlet). 

(b)  Specify  the  wavelet-packet  tree  depth  J. 

(c)  Specify  the  feature -vector  length  K. 

4.  Start  with  the  root  node,  node  n  =  1. 

5.  Denote  by  Pn  the  inherited  set  of  classes  for  node  n.  Let  Pn  have  size 
N. 

6.  Set  amin  =  1.0. 

7.  Let  Lq  =  {},  R0  =  Pn ,  i  —  0- 

8.  Denote  the  size  of  Ri  as  P.  Let  Ln(j )  =  {j,Li}  and  Rn(j )  =  Pn  - 

Ln(j). 

9.  Compute  LDB  for  node  n. 

10.  Select  best  K  elements  of  LDB  to  form  wn. 

11.  Compute  un  and  vn. 

12.  Compute  the  ambiguity  an(j). 

1 3.  Repeat  Steps  8-1 2  until  j  =  P. 

14.  Increment  i,  decrement  P. 

15.  Retain  the  Ln(j)  set  for  which  an(j)  is  minimum.  Denote  best  Ln(j) 
by  L{. 

16.  If  the  minimum  ambiguity  an(j )  is  greater  than  amin,  then  goto  Step 

18,  else  Set  Qrnin  —  riy i(j). 

1 7.  Repeat  Steps  8-1 5  until  Li  has  size  greater  than  or  equal  to  JV/2. 

18.  Find  childless  decision  node  with  specified  parent  node;  denote  by  n 
if  it  exists  and  goto  Step  5. 

19.  Stop. 


Figure  4:  Automatic  specification  of  BTC  topology,  superclasses,  and  features. 
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Class  Images  using  Grayscale  Camera 
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Figure  5:  An  eight-class  problem  used  for  illustration. 


Example  of  Joint  Algorithm  in  Action. 

As  an  example,  consider  the  eight-class  problem  shown  in  Figure  5.  In  this  example,  which  is 
explored  in  more  detail  in  Section  5,  the  eight  classes  are  interpreted  as  objects  viewed  through  a 
gray-scale  camera.  This  operation  results  in  perfectly  ambiguous  class  subsets,  such  as  {1,  2,  6}, 
and  {7,  8).  The  automatically  obtained  BTC  for  this  CID  is  shown  in  Figure  6.  Each  decision  node 
is  annotated  with  its  computed  ambiguity  and  each  terminal  node  is  annotated  with  a  class  label. 
The  important  point  here  is  that  the  automatic  algorithm  perfectly  captures  the  inherent  structure 
of  the  problem:  7  will  be  mistaken  for  8  and  vice  versa,  and  1,  2,  and  6  will  be  confused.  This 
simply  reflects  the  strong  ambiguities  in  the  problem. 

3.5  Specifying  the  BHC 

A  BHC  is  constructed  from  two  or  more  constituent  BTCs.  The  BTCs  must  attack  either  the  same 
classification  problem  or  subsets  of  a  single  problem,  but  otherwise  can  have  different  topologies, 
CIDs,  superclass  selections,  wavelets,  feature  lengths  K,  etc.  Crucial  to  creation  of  a  BHC  is  the 
notion  of  corresponding  nodes.  For  convenience,  the  mathematical  definition  of  corresponding 
nodes  is  repeated  here  [  ]: 

Defi  nition  1  (Corresponding  Nodes)  Let  LI  and  T2  denote  two  distinct  BTCs  and  let  n\  and  n2 
denote  decision  nodes  from  T\  and  T2,  respectively.  If  the  union  of  the  left  and  right  superclasses 
for  nodes  ni  and  n2  match,  then  these  two  nodes  are  corresponding.  If  the  superclasses  are  equal, 
then  the  nodes  are  equivalent.  Note  that  the  binary-tree  parameters  C\  and  C2for  1\  and  T2  need 
not  be  equal.  ■ 

Let  H  denote  a  binary  hypertree  with  N  constituent  binary  tree  classifiers  each  with  parameter 
C  and  each  addressing  the  same  classification  problem.  Assign  the  node  pointers  for  each  node 
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BTC  2.  CID:  Grayscale,  Wavelet:  Coiflet,  Param:  5 


Figure  6:  A  BTC  automatically  obtained  and  plotted  using  the  developed  software. 


such  that  the  node  points  to  the  BTC  with  minimum  ambiguity-corresponding  node.  These  pointers 
will  be  used  during  BHC  operation. 


4  Classifi  er  Operation 

The  operation  of  the  three  classifier  types  is  described  in  this  section.  Detailed  algorithm  statements 
for  classifier  operation  are  provided  in  the  companion  report  [  ].  Here  the  operation  is  described 
in  a  less  formal  way. 

4.1  BTC  and  BSC 

The  operation  of  these  classifiers  is  particularly  simple.  First,  the  item  to  be  classified  is  obtained. 
This  is  a  single  CID  for  the  problem  of  interest  for  the  BTC  and  all  available  CIDs  for  the  BSC. 
Starting  at  the  root  node  of  the  BTC  (or  BSC),  the  wavelet  signal  processing  operations  encoded 
in  the  measurement  vector  w  are  used  to  compute  the  feature  vector  f,  which  has  length  K.  The 
correlation  coefficient  between  f  and  the  left-going  average  feature  vector  v  is  computed  and  com¬ 
pared  to  that  for  the  right-going  feature  vector  u.  Take  the  left  path  if  the  feature  is  more  highly 
correlated  with  v,  otherwise  take  the  right  path.  Continue  in  this  manner  until  a  terminal  node  is 
reached.  The  class  label  associated  with  the  reached  terminal  node  is  the  decision. 

Note  that  the  use  of  the  correlation  coefficient  means  that  the  amplitude  of  the  data  element  to 
be  classified  is  irrelevant,  since  it  is  normalized  away. 
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4.2  BHC 

The  BHC  operation  is  only  slightly  more  complex  than  operation  for  the  BTC.  First,  the  item  to  be 
classified  is  obtained.  Start  with  whatever  CID  is  cheapest  to  obtain  or  start  with  the  CID  corre¬ 
sponding  to  the  BTC  with  minimum-ambiguity  root  node.  Compute  the  required  two  correlation 
coefficients  and  select  the  left  or  right  path  out  of  this  first  node.  Check  the  hypertree  pointer  in  the 
new  node.  If  it  points  to  the  current  tree  and  current  node,  then  continue,  else  jump  to  the  tree  and 
node  stored  in  the  pointer.  The  BTC  that  is  jumped  to  may  correspond  to  the  same  CID  or  a  new 
CID.  If  it  is  a  new  CID,  perform  the  operations  needed  to  obtain  the  data  and  compute  the  required 
correlation  coefficients.  Continue  in  this  way  until  a  terminal  node  is  reached. 

4.3  Path  Correction  in  the  BTC 

It  will  be  very  rare  to  construct  a  tree  classifier  such  that  all  decision  nodes  have  very  small  or 
zero  ambiguity.  This  will  happen  only  for  problems  that  are  inherently  easy  to  solve.  Therefore, 
BTCs  for  real-world  difficult  problems  will  have  one  or  more  nodes  with  relatively  high  ambiguity 
and,  therefore,  non-negligible  probability  of  decision  error.  Because  the  computational  operations 
required  during  tree-traversal  are  modest  in  complexity,  we  consider  multiple  traversals  of  the 
tree  in  an  attempt  to  provide  the  best  possible  decision.  In  particular,  we  outfit  the  classifier  with 
the  capbability  of  detecting  a  poor-quality  decision,  and  a  means  to  avoid  that  decision  during  a 
subsequent  traversal.  We  call  this  notion  path  correction. 

The  key  idea  in  path  correction  is  that  of  decision  quality.  When  the  node  ambiguities  encoun¬ 
tered  during  a  particular  tree  traversal  are  all  small,  we  would  expect  that  the  winning  correlation 
coefficient  at  each  node  in  the  traversed  path  will  be  close  to  one.  On  the  other  hand,  if  one  of  the 
encountered  nodes  has  high  ambiguity  and  an  erroneous  decision  is  made  at  this  node,  we  would 
expect  that  all  winning  correlation  coefficients  for  the  remaining  nodes  in  the  path  will  be  far  from 
one  since  the  path  cannot  contain  the  true  class.  Therefore,  it  should  be  possible  to  detect,  based 
on  the  sequence  of  winning  correlation  coefficients  obtained  during  traversal,  a  “bad  path.”  Then 
the  tree  can  be  retraversed  with  the  detected  low-quality  decision  ruled  out. 

In  our  path-correction  algorithm,  the  decision  quality  is  simply  the  value  of  the  winning  corre¬ 
lation  coefficient  at  the  decision  node  just  above  the  chosen  terminal  node.  The  tree  is  traversed  as 
many  times  as  needed,  until  either  the  decision  quality  is  high,  or  the  root  node  is  reached.  If  this 
happens,  the  path  correction  algorithm  simply  produces  the  original  decision. 

We  will  see  that  path  correction  is  particularly  well  suited  to  situations  involving  weak  ambigu¬ 
ities.  For  problems  containing  strong  ambiguities,  path  correction  should  not  help  since  multiple 
paths  through  the  tree  will  result  in  identical  decision  qualities  (cf.  Figures  5  and  6). 


5  Experiments 

In  this  section,  we  report  on  various  experiments  aimed  at  validating  our  classification  theory  and 
design.  In  Section  5.1,  we  examine  a  one-dimensional  problem  involving  16  classes  and  three  CID 
types.  The  classes  correspond  to  the  16  unique  maximal-length  shift-register  (MLSR)  sequences 
for  shift-register  length  eight  [  ].  In  Section  5.2,  we  examine  a  synthetic  two-dimensional  problem 
involving  eight  classes  and  four  CID  types.  The  classes  correspond  to  eight  physical  objects  seen 
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through  four  camera  types  and  involve  serious  ambiguities  for  each  CID.  In  Section  5.3  we  look 
at  several  classification  problems  using  publicly  available  data.  The  involved  classes  include  DNA 
sequences,  letters  of  the  roman  alphabet,  and  data  related  to  the  space  shuttle. 

The  goals  of  the  experiments  are  as  follows: 

1.  Validate  the  operation  of  the  basic  binary  tree  classifier. 

2.  Validate  the  performance  ordering  BSC  >  BHC  >  {BTC}. 

3.  Determine  the  data-burden  advantage  of  the  hypertree  classifier  over  the  supertree  (fusion) 
classifier. 

4.  Determine  the  influence  of  the  particular  choice  of  wavelet. 

5.  Determine  the  efficacy  of  path  correction  for  the  BTC  and  BSC. 

6.  Determine  how  well  the  system  automatically  determines  the  problem  structure;  that  is,  how 
well  the  system  identifies  and  isolates  ambiguous  classes  in  the  produced  trees. 

5.1  Toy  Problem  One:  One-Dimensional  Inputs 

In  this  section,  we  report  on  a  number  of  related  experiments  in  which  the  classes  of  interest  consist 
of  a  set  of  sixteen  binary  sequences.  The  sequences  have  length  256  and  are  the  sixteen  unique 
maximum-length  shift-register  (MLSR)  sequences  associated  with  a  shift-register  length  of  eight 
[8].  The  goal  of  the  classification  system  is  to  efficiently  use  the  three  classifier  input  data  (CID) 
types  to  correctly  determine  which  MLSR  sequence  gave  rise  to  the  data. 

Classifier  Input  Data  Types. 

There  are  three  CIDs  for  the  one-dimensional  problem.  These  correspond  to  the  sequences  them¬ 
selves,  filtered  sequences,  and  an  auxilliary  input  called  metal.  Graphs  of  the  sixteen  noise-free 
elements  of  each  CID  set  are  shown  in  Figures  7-9.  The  concept  here  is  that  one  CID  is  a  high- 
resolution  version  of  the  sequences  (which  will  later  suffer  from  low  SNR),  one  is  a  low-resolution 
version  (which  may  be  less  costly  to  obtain),  and  the  final  CID  is  an  auxilliary  piece  of  information 
that  has  low  discrimination  capability,  but  may  be  useful  for  resolving  ambiguities  arising  from  the 
first  two  CIDs. 

Experiment  Outputs. 

For  each  distinct  subexperiment,  the  following  results  will  be  provided. 

1.  The  obtained  classification  trees  will  be  displayed.  The  decision  nodes  will  be  annotated 
with  their  ambiguity  and  the  terminal  nodes  with  their  class  label. 

2.  The  overall  probability  of  correct  classification  ( Pcc )  will  be  graphed  versus  classifier  in¬ 
dex.  This  graph  reveals  the  overall  relations  between  the  BSC,  BHC,  and  the  constituent 
BTCs.  We  would  expect  that  the  BHC  and  BSC  have  the  best  performance,  and  could  be 
substantially  better  than  any  of  the  BTCs. 

3.  The  confusion  matrix  will  be  displayed  so  that  the  basic  misclassification  patterns  can  be 
observed  and  correlated  with  any  obvious  class  ambiguities. 
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Figure  7:  The  sixteen  MLSR  sequences  for  the  one-dimensional  experiments. 


4.  Two  histograms  will  be  provided  for  the  quality  measure.  The  first  corresponds  to  all  trials 
for  which  the  classification  is  correct,  and  the  second  corresponds  to  all  other  trials.  This 
will  allow  a  quantitative  assessment  of  the  quality  measure  as  a  tool  for  detecting  incorrect 
classifier  outputs. 

5.  For  the  BHC,  the  average  number  of  required  distinct  CID  types  will  be  reported.  This 
number  will  be  compared  to  the  data  input  requirement  for  the  appropriate  BSC  to  determine 
the  BHC’s  average  potential  savings  in  required  input  data. 

5.1.1  Experiment  1.1:  Basic  1-D  Processing 

In  this  first  MLSR  experiment,  BTC,  BHC,  and  BSC  structures  are  obtained  for  a  single  wavelet 
type.  Since  there  are  three  CIDs  and  one  measurement  type,  there  are  three  basic  BTCs,  one  BHC, 
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Figure  9:  The  auxilliary  information  CID  for  the  one-dimensional  experiments. 
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and  one  BSC,  for  a  total  of  five  distinct  classifiers.  Six  additional  derived  classifiers,  explained 
below,  are  added  to  obtain  eleven  classifiers.  The  experimental  parameters  are  provided  in  Table  1. 


Parameter 

Value 

Wavelet  Type 

Coiflet 

Wavelet  Parameter 

5 

Feature  Fength  K 

20 

Number  of  Classes  C 

16 

BTC/BHC  Wavelet  Tree  Depth  J 

6 

BSC  Wavelet  Tree  Depth  J 

8 

Number  of  CIDs 

3 

Data  Dimension 

[1  256] 

Processed  Data  Dimension 

[1  512] 

Training  SNR 

oo 

Input  SNR  CIDs  1,2,3 

10,  13,  lOdB 

Random  Translation 

None 

Random  Scaling 

None 

Tree  Topology 

Free 

Superclass  Assignment 

Free 

Number  of  Trials 

100 

Table  1:  Experimental  parameters  for  the  first  1-D  experiment. 


Probability  of  Correct  Classification  The  overall  classification  performance  for  the  three  basic 
BTCs,  the  BHC,  and  the  BSC  is  summarized  in  Figure  10. 

Automatically  Obtained  BTCs  and  BSC  The  obtained  BTC  structures  are  plotted  in  Figures 
11-15.  The  first  three  of  these  correspond  to  the  BTCs  that  are  found  for  the  three  CIDs,  and  are 
called  the  basic  BTCs.  The  remaining  six  BTCs  are  formed  by  fixing  the  structure  of  one  of  the 
basic  BTCs  and  replacing  the  CID  with  one  of  the  other  two  CIDs,  and  are  called  derived  BTCs. 
The  obtained  BSC  tree  is  shown  in  Figure  15.  In  these  plotted  trees,  and  in  all  others  in  this  report, 
nodes  that  are  “jump-to”  nodes  in  the  BHC  are  colored  green,  and  nodes  that  have  high  ambiguity 
are  colored  red.  Green  takes  precedence  over  red.  Otherwise,  the  node  is  black. 

Notice  the  prevalence  of  low-ambiguity  nodes  in  the  BTC  for  CID  1  (the  sequences  themselves) 
in  Figure  11.  This  correlates  well  with  the  performance  for  this  BTC  shown  in  Figure  10.  For  the 
second  CID,  we  see  from  Figure  12  that  the  tree  is  much  more  balanced,  but  that  the  ambiguities 
are  generally  larger.  These  two  effects  tend  to  oppose  each  other,  and  performance  is  as  good  as  for 
CID  1.  For  CID  3,  the  most  ambiguous  of  the  three  CIDs,  we  see  from  Figure  12  that  the  obtained 
tree  is  highly  unbalanced  and  contains  some  large  ambiguities  high  in  the  tree.  This  correlates  well 
with  the  observed  probability  of  correct  classification  of  0.4  in  Figure  10. 

Regarding  the  BSC  in  Figure  15,  we  see  that  the  obtained  tree  is  distinct  from  any  of  the  three 
basic  BTCs  and  that  there  are  no  high-ambiguity  nodes  in  the  upper  parts  of  the  tree.  However,  the 
tree  is  severely  unbalanced  at  the  root  node,  and  this  can  cause  a  serious  performance  shortfall. 
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Pcc  for  Experiment  1 .1 


Figure  10:  Classification  performance  for  Experiment  1.1 

Confusion  Matrices  The  confusion  matrices  for  the  three  basic  BTCs,  the  BHC,  and  the  BSC 
are  shown  in  Figures  16-18,  respectively.  Note  the  correlation  between  the  appearance  of  the 
confusion  matrix  and  the  structure  of  the  trees.  For  example,  for  CID  3,  we  have  the  tree  in  Figure 
12  and  the  confusion  matrix  in  Figure  17.  From  the  latter,  we  see  that  classes  5  and  6  are  commonly 
confused  for  a  large  number  of  input  classes.  From  the  tree,  we  see  that  these  two  classes  are  the 
only  possible  decisions  when  taking  the  left  path  out  of  the  root  node. 

Quality  Measures  The  histograms  of  the  quality  measures  for  Experiment  1.1  are  shown  in 
Figures  19-23. 

Conclusions  for  Experiment  1.1 

1.  The  BSC  is  outperformed  by  both  the  BTCs  and  the  BHC.  Since  the  BSC  has  access  to  all 
available  input  data,  it  should  not  be  outperformed  by  any  of  the  constituent  BTCs,  and  its 
performance  should  upper-bound  that  of  the  BHC.  The  reason  for  this  result  is  unclear. 

2.  The  BHC  outperforms  the  BTCs.  This  basically  means  that  the  algorithm  for  forming  the 
BHC  is  working  well,  choosing  the  best  nodes  in  the  correct  trees  so  that  switching  from  one 
BTC  to  the  next  results  in  a  performance  improvement  on  the  average. 

3.  The  algorithm  for  jointly  choosing  tree  topology  and  feature-vector  values  is  generating 
a  variety  of  topologies  and  tree-nodes  with  small  ambiguities.  We  also  observe  that  the 
algorithm  is  doing  a  good  job  in  pushing  large  ambiguities  downward  in  the  tree,  which  is 
a  necessary  condition  for  good  performance.  However,  since  the  structure  of  the  individual 
class  elements  is  hard  to  discern  (each  of  the  classes  is  represented  by  an  essentially  random 
binary  string),  it  is  difficult  to  determine  if  the  tree  topologies  make  sense  in  terms  of  good 
splits  of  the  incoming  class  labels  into  two  sets  of  outgoing  labels.  This  task  is  made  much 
easier  in  Experiment  2,  for  which  the  structure  of  the  data  is  more  easily  grasped  visually. 
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4.  The  quality  measure  is  a  good  indicator  of  the  correctness  of  a  classification  decision.  A 
large  majority  of  incorrect  decisions  result  in  a  quality  measure  that  is  less  than  0.90,  while 
virtually  all  correct  decisions  result  in  a  quality  measure  that  is  greater  than  0.95.  This 
implies  that  the  BTCs  and  the  BSC  may  benefit  from  path  correction  (see  Sections  4  and 
5.1.3). 
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BTC  1.  CID:  High-Res,  Wavelet:  Coiflet,  Param:  5 


BTC  2.  CID:  Low-Res,  Wavelet:  Coiflet,  Param:  5 


Figure  11:  BTCs  obtained  for  Experiment  1  (1-2  of  9). 
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BTC  3.  CID:  Metal,  Wavelet:  Coiflet,  Param:  5 


BTC  4.  CID:  Low-Res,  Wavelet:  Coiflet,  Param:  5 


Figure  12:  BTCs  obtained  for  Experiment  1  (3-4  of  9). 
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BTC  5.  CID:  Metal,  Wavelet:  Coiflet,  Param:  5 


BTC  6.  CID:  High-Res,  Wavelet:  Coiflet,  Param:  5 


Figure  13:  BTCs  obtained  for  Experiment  1  (5-6  of  9). 
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BTC  7.  CID:  Metal,  Wavelet:  Coiflet,  Param:  5 


BTC  8.  CID:  High-Res,  Wavelet:  Coiflet,  Param:  5 


Figure  14:  BTCs  obtained  for  Experiment  1  (7-8  of  9). 
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BTC  9.  CID:  Low-Res,  Wavelet:  Coiflet,  Param:  5 


Figure  15:  BTC  9  and  the  BSC  obtained  for  Experiment  1. 
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Figure  16:  Confusion  matrices  for  BTCs  1  and  2  in  Experiment  1.1. 


BTC  3  in  Exp  1.1  BHCinExpl.1 
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Figure  17:  Confusion  matrices  for  BTC  3  and  the  BHC  in  Experiment  1.1. 
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BSC  in  Exp  1.1 


Figure  18:  Confusion  matrix  for  the  BSC  in  Experiment  1.1. 


BTC  1  in  Exp  1.1 


Quality 


Figure  19:  Quality  histogram  for  BTC  1  in  Experiment  1.1. 


BTC  2  in  Exp  1.1 


Figure  20:  Quality  histogram  for  BTC  2  in  Experiment  1.1. 
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BTC  3  in  Exp  1.1 


Quality 


Figure  21:  Quality  histogram  for  BTC  3  in  Experiment  1.1. 


BHC  in  Exp  1.1 


Figure  22:  Quality  histogram  for  the  BHC  in  Experiment  1.1. 


BSC  in  Exp  1.1 


Figure  23:  Quality  histogram  for  the  BSC  in  Experiment  1.1. 
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5.1.2  Experiment  1.2:  Multiple  Wavelet  Types 

In  this  second  one-dimensional  experiment,  the  number  of  constituent  BTCs  available  for  use  in 
the  BHC  is  increased  by  allowing  distinct  measurement  functions  (wavelet  types)  in  addition  to 
the  distinct  data  types  (CIDs).  In  particular,  ten  distinct  wavelet  types  are  used  instead  of  just  one 
as  in  Experiment  1.1.  The  parameters  for  Experiment  1.2  are  shown  in  Table  2. 

For  each  wavelet  type,  the  three  basic  BTCs  are  found  through  the  algorithm  that  jointly  de¬ 
termines  tree  topology  and  feature-vector  values.  This  yields  a  total  of  thirty  BTCs.  For  each  of 
these,  two  additional  BTCs  are  created  by  fixing  the  topology  and  superclass  choices,  selecting 
one  of  the  other  CID  types,  and  retraining  the  structure.  This  yields  60  more  BTCs.  The  BHC  is 
based  on  these  90  classifiers.  Finally,  ten  BSCs  are  created,  one  for  each  wavelet  type. 


Parameter 

Value 

Wavelets 

Beylkin 

Coiflet 

1 

Coiflet 

5 

Daubechies 

4 

Daubechies 

20 

Symmlet 

4 

Symmlet 

10 

Vaidyanathan 

Battle 

1 

Battle 

3 

Feature  Fength  K 

20 

Number  of  Classes  C 

16 

BTC/BHC  Wavelet  Tree  Depth  J 

6 

BSC  Wavelet  Tree  Depth  J 

8 

Number  of  CIDs 

3 

Data  Dimension 

[1  256] 

Processed  Data  Dimension 

[1  512] 

Training  SNR 

oo 

Input  SNR  CIDs  1,2,3 

10,  13,  lOdB 

Random  Translation 

None 

Random  Scaling 

None 

Tree  Topology 

Free 

Superclass  Assignment 

Free 

Number  of  Trials 

100 

Table  2:  Experimental  parameters  for  the  second  1-D  experiment. 


Automatically  Obtained  BTCs  and  BSCs  The  thirty  basic  BTCs,  the  BHC,  and  the  ten  BSCs 
for  Experiment  1.2  tend  to  be  either  largely  balanced  or  severely  unbalanced.  For  reasons  of 
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brevity,  we  do  not  display  the  trees  in  this  section.  The  results  are  fairly  similar  to  those  for 
Experiment  1.1. 

Probability  of  Correct  Classification  The  probabilities  of  correct  classification  for  the  thirty 
basic  BTCs  and  the  ten  BSCs  in  Experiment  1.2  are  shown  in  Figure  24.  Since  the  BHC  is  outper¬ 
formed  by  several  BTCs,  including  those  used  in  Exp  1.1  (BTCs  7-9),  the  algorithm  for  construct¬ 
ing  a  BHC  from  constituent  BTCs  is  flawed,  else  it  could  restrict  its  attention  to  only  BTCs  7-9 
and  thereby  obtain  the  performance  of  the  BHC  in  Experiment  1.1. 

Note  also  that  the  BTC  and  BSC  performance  is  strongly  dependent  on  the  wavelet.  The  best 
wavelet  for  an  individual  BTC  is  Daubechies  with  CID  2  (BTC  11),  and  the  best  wavelet  for  the 
BSC  is  Beylkin  (BSC  1). 
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Figure  24:  Classification  performance  for  Experiment  1.2 


Confusion  Matrices  The  obtained  confusion  matrices  for  the  BTCs  are  shown  in  Figures  25-39, 
for  the  hypertree  classifier  in  Figure  40,  and  for  the  BSCs  in  Figures  41-45. 

Quality  Measures  Finally,  the  quality  measures  for  the  various  classifiers  in  Experiment  1.2  are 
shown  in  Figures  46-66. 

Conclusions  for  Experiment  1.2 

1.  Many  of  the  BTCs  for  this  experiment  result  in  good-to-excellent  performance,  but  the  exact 
performance  level  is  dependent  on  the  specific  wavelet. 

2.  The  BHC  is  not  correctly  constructed  from  the  set  of  constituent  BTCs.  Our  suspicion  is 
that  the  algorithm  is  relying  too  heavily  on  the  use  of  ambiguity  to  choose  the  “jump-to” 
nodes  in  the  hypertree.  This  must  be  balanced  by  the  power  of  the  obtained  average  feature 
vectors  (relative  to  the  average  power  of  the  input  classes).  Very  weak  average  features  may 
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Figure  25:  Confusion  matrices  for  BTCs  1  and  2  in  Experiment  1.2. 


BTC  3  in  Exp  1 .2:  Beylkin,  Metal  BTC  4  in  Exp  1 .2:  Coiflet,  High-Res 


2  4  6  8  10  12  14  16  2  4  6  8  10  12  14  16 

Output  Decision  Output  Decision 


Figure  26:  Confusion  matrices  for  BTCs  3  and  4  in  Experiment  1.2. 


have  ambiguities  near  zero,  but  are  also  highly  susceptible  to  noise,  so  that  the  hypertree 
algorithm  must  take  both  metrics  into  account  when  choosing  nodes.  The  current  version  of 
the  algorithm  does,  in  fact,  take  both  of  these  parameters  into  account,  but  the  method  must 
be  refined. 

3.  The  average  quality  measure  for  incorrect  decisions  is  very  small  compared  to  the  average 
quality  measure  for  correct  decisions.  This  implies  that  path  correction  may  provide  a  sub¬ 
stantial  performance  boost  for  all  BTCs  and  BSCs. 
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BTC  5  in  Exp  1.2:  Coiflet,  Low-Res 
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BTC  6  in  Exp  1.2:  Coiflet,  Metal 
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Figure  27:  Confusion  matrices  for  BTCs  5  and  6  in  Experiment  1.2. 


BTC  7  in  Exp  1 .2:  Coiflet,  High-Res 


BTC  8  in  Exp  1 .2:  Coiflet,  Low-Res 


Output  Decision 


Output  Decision 


Figure  28:  Confusion  matrices  for  BTCs  7  and  8  in  Experiment  1.2. 


BTC  9  in  Exp  1 .2:  Coiflet,  Metal  BTC  1 0  in  Exp  1 .2:  Daubechies,  High-Res 
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Figure  29:  Confusion  matrices  for  BTCs  9  and  10  in  Experiment  1.2. 
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BTC  1 1  in  Exp  1 .2:  Daubechies,  Low-Res 
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BTC  12  in  Exp  1.2:  Daubechies,  Metal 
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Figure  30:  Confusion  matrices  for  BTCs  11  and  12  in  Experiment  1.2. 


BTC  13  in  Exp  1.2:  Daubechies,  High-Res  BTC  14  in  Exp  1.2:  Daubechies,  Low-Res 
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Figure  31:  Confusion  matrices  for  BTCs  13  and  14  in  Experiment  1.2. 


BTC  15  in  Exp  1.2:  Daubechies,  Metal  BTC  16  in  Exp  1.2:  Symmlet,  High-Res 
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Figure  32:  Confusion  matrices  for  BTCs  15  and  16  in  Experiment  1.2. 
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BTC  17  in  Exp  1.2:  Symmlet,  Low-Res 
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BTC  18  in  Exp  1.2:  Symmlet,  Metal 
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Figure  33:  Confusion  matrices  for  BTCs  17  and  18  in  Experiment  1.2. 


BTC  19  in  Exp  1.2:  Symmlet,  High-Res 


BTC  20  in  Exp  1 .2:  Symmlet,  Low-Res 
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Figure  34:  Confusion  matrices  for  BTCs  19  and  20  in  Experiment  1.2. 
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Figure  35:  Confusion  matrices  for  BTCs  21  and  22  in  Experiment  1.2. 
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BTC  23  in  Exp  1 .2:  Vaidyanathan,  Low-Res  BTC  24  in  Exp  1 .2:  Vaidyanathan,  Metal 
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Figure  36:  Confusion  matrices  for  BTCs  23  and  24  in  Experiment  1.2. 
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Figure  37:  Confusion  matrices  for  BTCs  25  and  26  in  Experiment  1.2. 


BTC  27  in  Exp  1.2:  Battle,  Metal  BTC  28  in  Exp  1 .2:  Battle,  High-Res 
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Figure  38:  Confusion  matrices  for  BTCs  27  and  28  in  Experiment  1.2. 
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BTC  29  in  Exp  1 .2:  Battle,  Low-Res  BTC  30  in  Exp  1 .2:  Battle,  Metal 
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Figure  39:  Confusion  matrices  for  BTCs  29  and  30  in  Experiment  1.2. 
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Figure  40:  Confusion  matrix  for  the  BHC  in  Experiment  1.2. 
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2  4  6  8  10  12  14  16  2  4  6  8  10  12  14  16 

Output  Decision  Output  Decision 


Figure  41:  Confusion  matrices  for  BSCs  1  and  2  in  Experiment  1.2. 
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Figure  42:  Confusion  matrices  for  BSCs  3  and  4  in  Experiment  1.2. 
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BSC  5  in  Exp  1 .2:  Daubechies  BSC  6  in  Exp  1 .2:  Symmlet 
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Figure  43:  Confusion  matrices  for  BSCs  5  and  6  in  Experiment  1.2. 
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Figure  44:  Confusion  matrices  for  BSCs  7  and  8  in  Experiment  1.2. 
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Figure  45:  Confusion  matrices  for  BSCs  9  and  10  in  Experiment  1.2. 
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Figure  46:  Quality  histograms  for  BTCs  1  and  2  in  Experiment  1.2. 


BTC  3  in  Exp  1.2:  Beylkin,  Metal 


BTC  4  in  Exp  1.2:  Coiflet,  High-Res 
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Figure  47:  Quality  histograms  for  BTCs  3  and  4  in  Experiment  1.2. 
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BTC  5  in  Exp  1 .2:  Coiflet,  Low-Res 


BTC  6  in  Exp  1 .2:  Coiflet,  Metal 


Figure  48:  Quality  histograms  for  BTCs  5  and  6  in  Experiment  1.2. 


BTC  7  in  Exp  1.2:  Coiflet,  High-Res 
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Figure  49:  Quality  histograms  for  BTCs  7  and  8  in  Experiment  1.2. 
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BTC  9  in  Exp  1 .2:  Coiflet,  Metal 


BTC  10  in  Exp  1.2:  Daubechies,  High-Res 
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Figure  50:  Quality  histograms  for  BTCs  9  and  10  in  Experiment  1.2. 
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Figure  51:  Quality  histograms  for  BTCs  11  and  12  in  Experiment  1.2. 
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BTC  13  in  Exp  1.2:  Daubechies,  High-Res 
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BTC  14  in  Exp  1.2:  Daubechies,  Low-Res 


Figure  52:  Quality  histograms  for  BTCs  13  and  14  in  Experiment  1.2. 


BTC  15  in  Exp  1.2:  Daubechies,  Metal 


BTC  16  in  Exp  1.2:  Symmlet,  High-Res 


Figure  53:  Quality  histograms  for  BTCs 


15  and  16  in  Experiment  1.2. 
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BTC  17  in  Exp  1.2:  Symmlet,  Low-Res 


BTC  18  in  Exp  1.2:  Symmlet,  Metal 


Figure  54:  Quality  histograms  for  BTCs  17  and  18  in  Experiment  1.2. 


BTC  19  in  Exp  1.2:  Symmlet,  High-Res 
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BTC  20  in  Exp  1.2:  Symmlet,  Low-Res 


Figure  55:  Quality  histograms  for  BTCs  19  and  20  in  Experiment  1.2. 
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BTC  21  in  Exp  1.2:  Symmlet,  Metal 


BTC  22  in  Exp  1.2:  Vaidyanathan,  High-Res 


Figure  56:  Quality  histograms  for  BTCs  21  and  22  in  Experiment  1.2. 


BTC  23  in  Exp  1.2:  Vaidyanathan,  Low-Res 
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BTC  24  in  Exp  1.2:  Vaidyanathan,  Metal 


Figure  57:  Quality  histograms  for  BTCs  23  and  24  in  Experiment  1.2. 
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BTC  25  in  Exp  1.2:  Battle,  High-Res 


Figure  58:  Quality  histograms  for  BTCs  25  and  26  in  Experiment  1.2. 


BTC  27  in  Exp  1 .2:  Battle,  Metal 


Figure  59:  Quality  histograms  for  BTCs  27  and  28  in  Experiment  1.2. 
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BTC  29  in  Exp  1.2:  Battle,  Low-Res 


BTC  30  in  Exp  1 .2:  Battle,  Metal 


Figure  60:  Quality  histograms  for  BTCs  29  and  30  in  Experiment  1.2. 
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BHC  in  Exp  1 .2 


Figure  61:  Quality  histograms  for  the  BHC  in  Experiment  1.2. 


BSC  1  in  Exp  1.2:  Beylkin 


BSC  2  in  Exp  1 .2:  Coiflet 


Figure  62:  Quality  histograms  for  BSCs  1  and  2  in  Experiment  1.2. 


ISP  Technical  Note 


207 


50 


Frequency  of  Occurrence  Frequency  of  Occurrence  Frequency  of  Occurrence  Frequency  of  Occurrence 


Mission  Research  Corporation 


Integrated  Sensors  &  Processing 


BSC  3  in  Exp  1 .2:  Coiflet 


BSC  4  in  Exp  1 .2:  Daubechies 


Figure  63:  Quality  histograms  for  BSCs  3  and  4  in  Experiment  1.2. 
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Figure  64:  Quality  histograms  for  BSCs  5  and  6  in  Experiment  1.2. 
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BSC  7  in  Exp  1 .2:  Symmlet 


BSC  8  in  Exp  1 .2:  Vaidyanathan 


Figure  65:  Quality  histograms  for  BSCs  7  and  8  in  Experiment  1.2. 
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Figure  66:  Quality  histograms  for  BSCs  9  and  10  in  Experiment  1.2. 
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5.1.3  Experiment  1.3:  Path  Correction  in  the  BTC 

In  this  experiment,  we  revisit  Experiment  1.1  but  we  use  path  correction  in  the  BTCs  and  the  BSC. 
Path  correction  has  not  be  implemented  for  the  BHC.  Recall  that  path  correction  allows  a  BTC  to 
follow  several  different  paths  through  the  tree  until  one  is  found  that  has  sufficiently  high  decision 
quality.  The  paths  are  selected  by  detecting  the  presence  of  a  decision  node  with  poor  quality 
and  retraversing  the  tree  and  forcing  this  node  to  make  the  alternate  choice.  That  is,  we  do  not 
simply  compute  all  paths  through  the  tree  in  an  exhaustive  way  and  pick  the  best.  In  this  way,  path 
correction  usually  involves  a  small  number  of  iterations. 

Probability  of  Correct  Classification  The  probability  of  correct  classification  for  Experiment 

1.3  is  shown  in  Figure  67  (compare  to  Figure  10).  Note  that  the  performances  for  all  BTCs  and 
the  BSC  are  improved  over  the  corresponding  performance  for  Experiment  1.1.  The  most  dramatic 
increase  is  for  CID  3,  which  has  a  probability  of  correct  classification  of  0.4  without  path  correction 
and  greater  than  0.9  with  path  correction. 
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Figure  67:  Classification  performance  for  Experiment  1.3 


Confusion  Matrices  The  confusion  matrices  for  Experiment  1.3  are  shown  in  Figures  68-70. 
Note  that  for  BTCs  1  and  3  and  for  the  BSC  there  are  only  a  few  ambiguous  classes.  Performance 
for  BTC  1,  which  directly  operates  on  the  MLSR  sequence,  is  near  perfect,  indicating  that  the 
automatically  obtained  tree  structure  together  with  path  correction  can  find  the  small  number  of 
wavelet  basis  vectors  needed  for  excellent  performance  with  no  prior  knowledge  provided  to  the 
algorithm. 

Quality  Measures  The  decision-quality  histograms  for  Experiment  1 .3  are  shown  in  Figures  71- 
75.  There  are  much  fewer  low-quality  incorrect  decisions  as  compared  to  Experiment  1.1  (no  path 
correction).  However,  the  BHC,  which  does  not  yet  have  path  correction  available,  still  produces  a 
large  number  of  low-quality  incorrect  decisions. 
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Figure  68:  Confusion  matrices  for  BTCs  1  and  2  in  Experiment  1.3. 


Conclusions  for  Experiment  1.3 

1.  Path  correction  appears  to  substantially  improve  performance  for  a  BTC  (or  BSC,  being  a 
special  case  of  the  BTC).  For  CID  3,  performance  improved  from  Pcc  =  0.4  to  0.9. 

2.  The  presence  of  a  large  number  of  very  low-quality  decisions  for  the  BHC  indicates  that  the 
BHC  may  also  be  greatly  improved  by  employing  some  form  of  path  correction.  This  has 
not  yet  been  done  since  the  path  through  the  hypertree  is  much  more  complex  than  the  path 
through  a  single  BTC. 

3.  Our  suspicion  that  path  correction  is  particularly  well  suited  to  problems  having  soft  class 
ambiguities  is  supported  by  the  results  of  Experiment  1.3.  In  Experiment  2  we  will  look  for 
further  evidence,  since  the  classes  in  that  problem  have  multiple  hard  ambiguities. 
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Figure  69:  Confusion  matrices  for  BTC  3  and  the  BHC  in  Experiment  1.3. 
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Figure  70:  Confusion  matrix  for  the  BSC  in  Experiment  1.3. 
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Figure  71:  Quality  histogram  for  BTC  1  in  Experiment  1.3. 
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Figure  72:  Quality  histogram  for  BTC  2  in  Experiment  1.3. 


BTC  3  in  Exp  1.3 


Figure  73:  Quality  histogram  for  BTC  3  in  Experiment  1.3. 
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Figure  74:  Quality  histogram  for  the  BHC  in  Experiment  1.3. 
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Figure  75:  Quality  histogram  for  the  BSC  in  Experiment  1.3. 
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5.2  Toy  Problem  Two:  Two-Dimensional  Inputs 

In  this  section,  we  report  on  a  number  of  related  classification  experiments  for  which  the  synthetic 
input  classes  consist  of  eight  objects.  The  objects  are  represented  as  images  (two-dimensional 
inputs),  and  there  are  four  classifier  input  data  (CID)  types.  The  physical  interpretation  is  that  each 
object  can  be  viewed  through  one  of  four  distinct  camera  types.  The  images  have  dimension  32  by 
32  for  each  CID.  The  goal  of  the  processing  is  to  correctly  classify  the  object  type  (decide  on  the 
class  label)  given  a  minimum  number  of  CIDs. 

Classifier  Input  Data  Types. 

The  eight  classes  and  four  CIDs  result  in  the  thirty-two  images  shown  in  Figure  76.  The  eight 
underlying  abstract  objects  are  seen  through  four  cameras:  black-and-white,  gray-scale,  color,  and 
infrared.  Notice  the  substantial  ambiguity  built  into  each  of  the  CIDs.  For  example,  for  CID  1 
(black-and-white  camera),  classes  1,  2,  5,  and  6  are  identical,  and  classes  7  and  8  are  identical. 
So  the  best  possible  classifier  for  this  type  of  CID  will  produce  many  errors.  The  idea  behind  the 
present  research  is  the  development  of  a  classifier  that  can  take  advantage  of  the  additional  CIDs 
in  such  a  way  as  to  resolve  all  ambiguities  while  simultaneously  using  the  minimum  number  of 
additional  CIDs. 

Experimental  Outputs. 

The  outputs  of  the  experiments  are  identical  to  those  in  Experiment  1 :  the  obtained  classification 
trees,  the  probability  of  correct  classification  for  each  classifier,  the  confusion  matrix  for  each 
classifier,  quality-measure  histograms,  and  the  number  of  distinct  CIDs  required  by  the  BHC. 

5.2.1  Experiment  2.1:  Basic  2-D  Processing 

In  the  first  two-dimensional  experiment,  the  BTC,  BHC,  and  BSC  structures  are  obtained  for  a 
single  wavelet  type  and  high  SNR.  Classification  experiments  are  performed  to  illustrate  the  rela¬ 
tive  effectiveness  of  the  BTCs,  the  BHC,  and  the  BSC.  This  establishes  the  basic  correctness  of  the 
approach  and  code,  and  sets  up  a  baseline  for  comparison  with  more  complex  experimental  results. 
The  parameters  for  Experiment  2.1  are  listed  in  Table  3. 

Automatically  Obtained  BTCs  and  BSC  The  trees  for  the  obtained  BTCs  and  the  BSC  are 
shown  in  Figures  77-85.  Note  that  the  first  four  BTCs  are  the  basic  BTCs,  and  the  remainder  are 
derived  (see  Section  5.1).  It  is  instructive  to  study  the  four  basic  BTCs  while  keeping  in  mind  the 
ambiguities  shown  in  Figure  76.  This  is  done  in  the  following  paragraphs. 

For  CID  1  (black-and-white  camera),  we  have  the  BTC  in  Figure  77.  The  important  point 
is  that  the  ambiguities  obvious  from  Figure  76  are  reflected  in  the  tree  structure.  In  particular, 
the  ambiguous  classes  of  7  and  8  are  assigned  their  own  branch  in  the  tree,  as  are  the  four-way 
ambiguous  classes  1,2,5,  and  6.  Therefore,  classes  7  and  8  will  be  confused  for  each  other,  as  is 
natural,  and  1,2,5,  and  6  will  be  confused  for  each  other.  If  the  classes  7  and  8  are  replaced  by  class 
A  and  classes  1,2,5,  and  6  are  replaced  by  the  single  class  B,  then  the  resulting  tree  would  have  the 
structure  required  by  the  formalism  of  this  work:  there  would  be  a  single  path  through  the  tree  to 
A  and  a  single  path  to  B.  In  this  sense,  the  obtained  tree  should  produce  perfect  results. 
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Parameter 

Value 

Wavelet  Type 

Coiflet 

Wavelet  Parameter 

5 

Feature  Fength  K 

20 

Number  of  Classes  C 

8 

BTC/BHC  Wavelet  Tree  Depth  J 

4 

BSC  Wavelet  Tree  Depth  J 

5 

Number  of  CIDs 

4 

Data  Dimension 

[32  32] 

Training  SNR 

oo 

Input  SNR  CIDs  1,2, 3, 4 

lOdB 

Random  Translation 

None 

Random  Scaling 

None 

Tree  Topology 

Free 

Superclass  Assignment 

Free 

Number  of  Trials 

100 

Table  3:  Experimental  parameters  for  the  first  2-D  experiment. 


For  CID  2  (gray-scale  camera),  we  have  the  BTC  in  Figure  78.  Again,  the  inherent  ambiguities 
are  reflected  in  the  structure.  Here  the  classes  7  and  8  form  one  ambiguous  set,  and  the  classes  1,2, 
and  6  form  the  other. 

For  CID  3  (color  camera),  the  BTC  is  shown  in  Figure  79.  Here  the  obtained  structure  does 
not  quite  reflect  the  inherent  ambiguity  of  the  problem.  Classes  7  and  8,  which  form  an  ambiguous 
class,  are  found  in  separate  regions  of  the  tree.  The  other  set  of  ambiguous  classes,  2,  5  and  6,  are 
successfully  isolated  and  grouped  together  in  the  tree. 

Finally,  for  CID  4  (infrared  camera),  the  BTC  is  given  in  Figure  80.  All  ambiguous  sets  are 
correctly  represented  in  the  tree:  {1,2,  5},  {4,  8},  and  {3,  6). 
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BTC  1 .  CID:  BlackWhite,  Wavelet:  Coiflet,  Param:  5 


BTC  2.  CID:  Grayscale,  Wavelet:  Coiflet,  Param:  5 


Figure  77:  BTCs  obtained  for  Experiment  2  (1-2  of  16). 
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BTC  3.  CID:  Color,  Wavelet:  Coiflet,  Param:  5 


Figure  78:  BTCs  obtained  for  Experiment  2  (3-4  of  16). 
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BTC  5.  CID:  Grayscale,  Wavelet:  Coiflet,  Param:  5 


BTC  6.  CID:  Color,  Wavelet:  Coiflet,  Param:  5 


Figure  79:  BTCs  obtained  for  Experiment  2  (5-6  of  16). 
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BTC  7.  CID:  Infrared,  Wavelet:  Coiflet,  Param:  5 


BTC  8.  CID:  BlackWhite,  Wavelet:  Coiflet,  Param:  5 


Figure  80:  BTCs  obtained  for  Experiment  2  (7-8  of  16). 
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BTC  9.  CID:  Color,  Wavelet:  Coiflet,  Param:  5 


BTC  10.  CID:  Infrared,  Wavelet:  Coiflet,  Param:  5 


Figure  81:  BTCs  obtained  for  Experiment  2  (9-10  of  16). 
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BTC  1 1 .  CID:  BlackWhite,  Wavelet:  Coiflet,  Param:  5 


BTC  12.  CID:  Grayscale,  Wavelet:  Coiflet,  Param:  5 


Figure  82:  BTCs  obtained  for  Experiment  2  (11-12  of  16). 
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BTC  13.  CID:  Infrared,  Wavelet:  Coiflet,  Param:  5 


Figure  83:  BTCs  obtained  for  Experiment  2  (13-14  of  16). 
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BTC  15.  CID:  Grayscale,  Wavelet:  Coiflet,  Param:  5 


Figure  84:  BTCs  obtained  for  Experiment  2  (15-16  of  16). 
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BSC 


Figure  85:  BSC  obtained  for  Experiment  2.1. 
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Probability  of  Correct  Classification  The  estimates  of  the  probability  of  correct  classification 
(Pcc)  for  Experiment  2.1  are  given  in  Figure  86.  Note  that  the  predicted  performance  ordering 
holds:  BSC  >  BHC  >  {BTC}" 
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Figure  86:  Probability  of  correct  classification  for  Experiment  2.1. 


Confusion  Matrices  The  confusion  matrices  for  the  BTCs,  BHC,  and  BSC  for  Experiment  2.1 
are  shown  in  Figures  87-89.  It  is  very  easy  to  see,  for  each  CID,  the  correspondence  between  the 
ambiguous  classes  and  the  errors  in  the  confusion  matrix. 

Quality  Measures  The  histograms  for  the  quality  measure  for  Experiment  2.1  are  shown  in  Fig¬ 
ures  90-92.  What  is  most  notable  here  is  the  large  number  of  very-high-quality  incorrect  decisions 
made  in  BTCs  1  and  3,  and  in  the  BHC.  This  should  render  path  correction  much  less  effective 
than  in  the  one-dimensional  experiments. 

Conclusions  for  Experiment  2.1 

1.  The  algorithms  for  creating  basic  binary  tree  classifiers  and  the  hypertree  classifier  can  work 
well  even  for  problems  involving  multiple  data  types  with  severe  ambiguities. 

2.  The  performance  ordering  is  as  predicted  by  the  theory,  with  the  BSC  outperforming  the 
BHC,  which  outperforms  all  its  constituent  BTCs  by  a  wide  margin. 

3.  Excellent  performance  can  be  had  through  use  of  only  a  small  number  of  key  wavelet  coef¬ 
ficients  even  when  each  consituent  BTC  in  a  hypertree  has  very  poor  performance. 
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BTC  1  in  Exp  2.1 


1  2  3  4  5  6  7  8 

Output  Decision 


BTC  2  in  Exp  2.1 


1  2  3  4  5  6  7  8 

Output  Decision 


Figure  87:  Confusion  matrices  for  BTCs  1  and  2  in  Experiment  2.1. 
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Figure  88:  Confusion  matrices  for  BTCs  3  and  4  in  Experiment  2.1. 
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Figure  89:  Confusion  matrices  for  the  BHC  and  BSC  in  Experiment  2.1. 
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Figure  90:  Quality  histograms  for  BTCs  1  and  2  in  Experiment  2.1. 
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Figure  91:  Quality  histograms  for  BTCs  3  and  4  in  Experiment  2.1. 
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Figure  92:  Quality  histograms  for  the  BHC  and  BSC  in  Experiment  2.1. 
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5.2.2  Experiment  2.2:  Multiple  Wavelets 

In  this  experiment,  we  consider  the  influence  of  the  particular  wavelet  used  in  forming  the  LDB 
(cf.  Section  5.1.2).  The  parameters  for  the  experiment  are  shown  in  Table  4.  Six  different  wavelet 
types  are  employed  and  for  three  of  these,  two  variants  are  used,  for  a  total  of  ten  distinct  wavelets. 


Parameter 

Value 

Wavelets 

Beylkin 

Coiflet 

1 

Coiflet 

5 

Daubechies 

4 

Daubechies 

20 

Symmlet 

4 

Symmlet 

10 

Vaidyanathan 

Battle 

1 

Battle 

3 

Feature  Fength  K 

20 

Number  of  Classes  C 

8 

BTC/BHC  Wavelet  Tree  Depth  J 

4 

BSC  Wavelet  Tree  Depth  J 

5 

Number  of  CIDs 

4 

Data  Dimension 

[32  32] 

Training  SNR 

oo 

Input  SNR  CIDs  1,2, 3, 4 

lOdB 

Random  Translation 

None 

Random  Scaling 

None 

Tree  Topology 

Free 

Superclass  Assignment 

Free 

Number  of  Trials 

100 

Table  4:  Experimental  parameters  for  the  first  2-D  experiment. 


Automatically  Obtained  BTCs  and  BSC  The  numerous  automatically  obtained  BTCs  are  quite 
similar  to  those  obtained  in  Experiment  2.1  and  are  not  shown  for  reasons  of  brevity.  The  ten  ob¬ 
tained  BSCs  are  shown  in  Figures  93-97.  All  ten  are  distinct,  but  they  reflect  the  sets  of  ambiguities 
inherent  in  the  four  CIDs,  considered  jointly  (see  Figure  76).  For  example,  the  classes  1,  2,  5,  and 
6  almost  always  appear  together  in  an  isolated  part  of  the  tree,  which  reflects  the  fact  that  either 
three  or  four  of  these  classes  are  ambiguous  in  all  CIDs.  None  of  the  BSCs  have  a  particularly 
small  ambiguity  for  the  root  node,  reflecting  again  the  fact  that  even  the  aggregate  data  (the  four 
CIDs  considered  jointly)  is  riddled  with  ambiguity. 
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BSC  1.  Wavelet:  Beylkin,  Param:  0 


BSC  2.  Wavelet:  Coiflet,  Param:  1 


Figure  93:  BSCs  obtained  for  Experiment  2.2  (1-2  of  10). 
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BSC  3.  Wavelet:  Coiflet,  Param:  5 


BSC  4.  Wavelet:  Daubechies,  Param:  4 


Figure  94:  BSCs  obtained  for  Experiment  2.2  (3-4  of  10). 
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BSC  5.  Wavelet:  Daubechies,  Param:  20 


BSC  6.  Wavelet:  Symmlet,  Param:  4 


Figure  95:  BSCs  obtained  for  Experiment  2.2  (5-6  of  10). 
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BSC  7.  Wavelet:  Symmlet,  Param:  10 


BSC  8.  Wavelet:  Vaidyanathan,  Param:  0 


Figure  96:  BSCs  obtained  for  Experiment  2.2  (7-8  of  10). 
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BSC  9.  Wavelet:  Battle,  Param:  1 


BSC  10.  Wavelet:  Battle,  Param:  3 


Figure  97:  BSCs  obtained  for  Experiment  2.2  (9-10  of  10). 
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Probability  of  Correct  Classification  The  estimated  probabilities  of  correct  classification  for 
Experiment  2.2  are  shown  in  Figure  98.  The  forty  BTCs  are  obtained  as  follows.  The  first  ten 
BTCs  are  the  basic  BTCs,  one  per  wavelet  in  Table  4.  The  remaining  thirty  BTCs  are  derived  from 
these  ten  BTCs  by  keeping  the  structure,  changing  the  CID,  and  retraining.  The  most  obvious 
result  is  that  the  classifier  performance  is  not  particularly  sensitive  to  the  wavelet  choice.  Six  of  the 
BSCs  achieve  a  Pcc  of  greater  than  0.98;  the  rest  are  near  0.9.  The  hypertree  classifier  outperforms 
all  forty  of  its  constituent  BTCs  but  is  generally  outperformed  by  the  BSCs. 
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Figure  98:  Probability  of  correct  classification  for  Experiment  2.2. 


Confusion  Matrices  and  Quality  Measures  The  confusion  matrices  and  histograms  of  decision 
quality  are  not  particularly  informative  and  are  not  shown. 

Conclusions  for  Experiment  2.2 

1.  The  basic  performance  ordering,  predicted  by  theory,  of  BSC  >  BHC  >  {BTC}  holds,  as  in 
Experiment  2.1. 

2.  Classifier  performance  is  not  strongly  sensitive  to  the  wavelet  choice,  unlike  the  one-dimensional 
experiment.  This  is  likely  due  to  the  fact  that  the  performance  here  is  dominated  by  se¬ 
vere  ambiguities  which  cannot  be  influenced  by  a  better  match  between  the  classes  and  the 
wavelet  shape. 
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5.2.3  Experiment  2.3:  Path  Correction  in  the  BTC 

In  this  final  experiment,  we  revisit  Experiment  2.1  and  allow  the  BTCs  and  BSC  to  employ  the 
path-correction  algorithm.  Recall  that  this  optional  algorithm  greatly  improved  performance  in 
Experiment  1.  In  Experiment  2.3,  we  employ  the  structures  obtained  in  Experiment  2.1,  but  allow 
path  correction  to  take  place. 

Probability  of  Correct  Classification  The  estimated  probability  of  correct  classification  for  Ex¬ 
periment  2.3  is  shown  in  Figure  99.  By  comparing  with  Figure  67,  we  see  that  the  performances 
for  the  BTCs  are  slightly  worse  in  Experiment  2.3,  and  the  performances  for  the  BHC  and  BSC 
are  unchanged.  So,  path  correction  does  not  help  here.  It  may  have  helped  with  the  BSC,  but  its 
performance  without  path  correction  was  already  nearly  perfect,  so  there  are  few  paths  to  correct. 
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Figure  99:  Probability  of  correct  classification  for  Experiment  2.3. 


Confusion  Matrices  The  confusion  matrices  for  Experiment  2.3  are  shown  in  Figures  100-102. 

Quality  Measures  The  histograms  of  the  output  quality  measure  for  Experiment  2.3  are  shown 
in  Figures  103-105.  As  in  Experiment  2.1,  there  are  many  incorrect  decisions  with  very  high 
quality. 

Conclusions  for  Experiment  2.3 

1.  Path  correction  does  not  help  a  BTC  (or  BSC)  when  the  problem  contains  severe  ambiguities. 
This  is  because  such  ambiguities  result  naturally  in  incorrect  decisions  with  high  quality. 
Even  when  the  classifier  attempts  to  correct  a  path,  it  will  likely  end  up  choosing  one  of  the 
decisions  in  the  equivalence  class  made  up  by  the  ambiguous  classes. 
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Figure  100:  Confusion  matrices  for  BTCs  1  and  2  in  Experiment  2.3. 
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Figure  101:  Confusion  matrices  for  BTCs  3  and  4  in  Experiment  2.3. 
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Figure  102:  Confusion  matrices  for  the  BHC  and  BSC  in  Experiment  2.3. 
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Figure  103:  Quality  histograms  for  BTCs  1  and  2  in  Experiment  2.3. 
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Figure  104:  Quality  histograms  for  BTCs  3  and  4  in  Experiment  2.3. 
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Figure  105:  Quality  histograms  for  the  BHC  and  BSC  in  Experiment  2.3. 
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5.3  Collected-Data  Problem:  The  StatLog  Data  Sets 

This  section  presents  the  results  of  various  experiments  performed  by  applying  the  binary  tree 
classifiers  (BTCs)  and  the  binary  hypertree  classifier  (BHC)  to  publicly  available  data  sets  intended 
for  use  by  classifier-algorithm  developers. 

In  particular,  we  present  test  results  for  the  StatLog  data  sets  [9].  The  StatLog  data  sets  are 
publicly  available  data  sets  obtained  from  the  Laboratory  of  Artificial  Intelligence  and  Computer 
Science  (LIACC)  at  University  of  Porto.  These  data  sets  were  used  in  the  “Project  StatLog”  of  the 
Machine  Learning  subgroup  for  evaluation  and  characterization  of  machine  learning,  neural,  and 
statistical  classification  algorithms.  Some  of  the  data  sets  in  the  StatLog  database  were  originally 
obtained  from  the  larger  database  at  the  UCI  Machine  Learning  Repository  [  ].  The  test  results 

of  this  section  aid  in  the  evaluation  of  the  overall  BTC  and  BHC  performance  and  robustness. 

5.3.1  Data  Set  Description 

There  are  a  total  of  ten  different  data  sets  in  the  StatLog  repository.  Four  of  these  data  sets  (DNA, 
Letter,  Shuttle,  Satimage),  which  already  have  the  training  and  test  data  sets  split,  are  used  to 
obtain  the  results  presented  herein.  Some  of  the  data  sets  are  claimed  to  have  been  “processed.” 
For  example,  for  the  DNA  sequence  set,  the  data  in  the  file  is  already  converted  to  numerical 
values  from  the  original  symbolic  variables  representing  the  nucleotides.  Table  5  tabulates  some 
parameters  pertaining  to  these  four  data  sets. 

A  brief  description  of  each  of  the  four  data  sets  follows. 

1.  DNA:  Each  entry  in  the  data  set  represents  a  DNA  sequence  with  splicing  boundaries  to 
be  classified.  The  original  60  symbolic  variables/attributes  in  the  sequence  representing  the 
nucleotides  have  been  converted  into  180  binary  indicator  variables.  Specifically,  the  four 
alphabet  symbols  (A,  C,  G,  T)  representing  the  nucleotides  are  mapped  into  the  four  binary 
sequences:  100,  010,  001,  and  000  respectively.  A  note  posted  on  the  repository  site  states 
that  much  better  performance  is  generally  observed  if  attributes  closest  to  the  junction  are 
used,  and  these  correspond  to  attribute  indices  61-120. 

2.  Letter:  The  original  black-and-white  pixel  displays  of  the  26  capital  letters  in  20  different 
fonts  randomly  distorted  are  converted  into  16  primitive  numerical  attributes  representing 
statistical  moments  and  edge  counts  and  then  scaled  to  the  range  of  integers  from  0  to  15  to 
form  this  data  set.  This  is  the  only  preprocessing  done  to  the  data  set. 

3.  Shuttle:  No  explicit  data  description  is  available  for  this  data  set.  The  class  labels  appear 
to  indicate  some  sort  of  shuttle  control  signals,  such  as  Rad  Flow,  Fpv  Close,  Fpv  Open, 
Bypass,  etc.  The  only  preprocessing  done  to  the  data  set  was  the  stripping  away  of  the 
time  ordering  information  by  randomizing  the  order  in  which  the  original  data  vectors  came. 
This  ordering  information  could  be  relevant  in  classification  that  takes  into  account  control 
sequencing  information. 

4.  Satimage:  This  data  set  represents  data  from  Landsat  satellite  images.  Each  sample  repre¬ 
sents  a  3-by-3  square  image  within  an  82-by-100  pixel  area  that  is  contained  in  the  original 
image  of  2340-by-3380  pixels.  The  images  correspond  to  digital  images  of  the  same  scene 
in  4  different  spectral  bands.  Class  labels  include  red  soil,  cotton  crop,  grey  soil  etc.  Each 
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data  vector  comprises  the  range  of  values  (from  0  for  black  to  255  for  white)  of  the  9  pixels 
in  4  different  spectral  bands  resulting  in  36  attributes.  A  note  in  the  data  description  states 
that  to  avoid  the  problem  which  arises  when  a  3-by-3  neighborhood  straddles  a  boundary, 
one  can  consider  using  only  the  4  attributes  17-20  which  correspond  to  the  4  spectral  values 
for  the  center  pixel.  No  preprocessing  was  done  to  this  data  set. 

5.3.2  Experimental  Set-Up 

A  total  of  16  experiments  were  performed  on  the  StatLog  data  sets  by  varying  a  number  of  pa¬ 
rameters  including  the  data  set  itself,  the  number  of  input  vectors  used  to  train  the  BTC,  wavelet 
types,  data  dimension  (the  number  of  attributes  used),  processed  data  dimension  (dimension  af¬ 
ter  any  zero  padding),  feature  length  K,  and  wavelet  tree  depth  J.  Relevant  parameters  for  each 
individual  experiment  are  summarized  in  Tables  8-11.  The  parameters  that  are  varied  in  the  16 
experiments  are  tabulated  in  Table  6.  In  these  tables,  Ntr  and  Nte  denote,  respectively,  the  number 
of  training  and  test  input  vectors  used  in  the  experiments.  In  order  to  expedite  processing,  the  first 
Ntr  data  vectors  in  each  class  of  the  training  data  sets  were  used  to  train  the  BTC,  and  the  resulting 
classifier  structures  were  then  applied  to  the  first  Nte  data  vectors  of  the  test  data  sets.  It  is  noted, 
however,  that  the  data  distribution  is  not  uniform  over  the  data  class.  That  is,  the  number  of  data 
vectors  available  for  each  class  is  different.  For  the  Shuttle  data  set,  about  80%  of  the  data  belongs 
to  Class  1,  and  the  distribution  ranges  from  6  to  34108  for  the  training  set  and  from  2  to  1 1478  for 
the  test  set.  For  the  remaining  three  data  sets,  there  are  at  least  100  vectors  per  class.  Therefore, 
the  number  of  training  and  test  data  vectors  used  in  the  experiment  for  the  Shuttle  data  is  chosen 
to  be  <  100  (i.e.  utilizing  all  vectors  given  in  a  class  whose  size  is  smaller  than  100).  For  the 
remaining  data  sets,  the  number  of  test  data  vectors  used  is  chosen  to  be  Nte  =  100.  The  number 
Ntr  of  training  vectors  used  for  each  data  set  is  shown  in  Table  6. 

To  further  expedite  processing,  only  the  first  8  classes  (letters  A-H)  were  used  for  the  Letter  data 
set  experiment  while  two  different  numbers  of  training  vectors  were  used  for  comparison  purposes 
in  this  experiment.  Three  different  wavelet  types  were  used  for  the  Letter  and  Satimage  data  sets 
in  order  to  observe  any  sensitivity  to  the  choice  of  wavelets.  Data  dimension  is  varied  for  only  the 
DNA  and  Satimage  data  sets  according  to  the  notes  given  in  Section  5.3.1  on  the  possible  effects 
of  the  choice  of  attributes  on  performance.  The  processed  data  dimension  is  simply  the  dimension 
of  the  data  vector  zero  padded  to  fulfill  a  dyadic  dimension  restriction.  Values  of  K  and  J  are  also 
varied  to  study  their  effects  on  performance.  It  is  noted  however  that  these  16  experiments  are  not 
meant  to  be  exhaustive.  They  simply  serve  the  purpose  of  providing  insight  into  the  applicability 
of  the  BTC  in  classifying  inputs  of  various  types. 

5.3.3  Results  and  Discussion 

Results  of  the  16  experiments  are  shown  in  Figures  106-153.  These  correspond  to  16  sets  of  three 
figures,  one  for  each  experiment.  Results  of  experiments  with  path  correction  are  also  shown  in 
each  figure  for  comparison  purpose.  The  following  results  are  presented  in  the  set  of  three  figures: 

1.  Quality  measures  as  a  function  of  trial  indices.  The  trial  indices  are  ordered  in  groups, 
one  for  each  class.  The  quality  measure  is  the  quality  of  the  decision  made  by  traversing 
the  classification  tree.  This  quality  is  a  function  of  the  set  of  correlation  coefficient  pairs 
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encountered  during  the  tree  traversal.  Also  indicated  are  the  probability  of  error  Pe  and 
probability  of  correction  classification  Pcc  over  all  classes  and  all  trials. 

2.  Confusion  matrices  that  show  the  basic  misclassification  patterns. 

3.  Histograms  of  the  quality  measures  corresponding  to  trials  for  which  the  classification  is 
correct  (indicated  by  red  bars)  and  trials  for  which  the  classification  is  incorrect  (indicated 
by  blue  bars). 

Table  7  summarizes  the  individual  (single  class)  and  overall  classification  performance  in  terms  of 
p 

±  CC  • 

Based  on  the  figures,  there  does  not  seem  to  be  noticeable  overall  improvement  with  the  use  of 
path  correction  in  these  data  sets.  While  improvement  is  observed  in  the  results  for  certain  classes 
and  data  sets,  degradation  is  observed  in  others.  This  is  possibly  due  to  the  nature  of  the  data 
such  as  interclass  ambiguity.  The  effect  of  path  correction  is  therefore  not  conclusive  from  these 
experiments. 

For  the  DNA  data  set,  an  overall  performance  degradation  is  observed  in  Experiment  2  with  the 
use  of  the  full  data  dimension  versus  the  use  of  only  attributes  61-120  in  Experiment  1,  which  is 
claimed  to  yield  better  results  (cf.  DNA  data  set  description  in  Section  5.3.1).  However,  it  is  noted 
that  while  performance  is  worsened  for  Classes  1  and  3,  performance  for  Class  2  is  improved. 

For  the  Letter  data  set,  the  use  of  a  larger  number  of  training  vectors  (500  instead  of  100)  does 
not  appear  to  improve  performance  noticeably  between  Experiments  3  and  4  for  which  K  —  8.  But 
for  K  —  16  (Experiments  5-10),  the  use  of  a  larger  number  of  training  vectors  appears  to  improve 
overall  performance.  Increasing  K  from  8  to  16  also  appears  to  improve  general  performance 
slightly.  It  is  interesting  to  note  that  with  the  larger  K,  performance  for  Classes  1-4  is  improved 
while  performance  for  Classes  5-8  is  worse.  It  is  also  noted  that  the  low-quality  measures  evident 
in  the  results  for  K  =  8  are  eliminated  in  the  results  for  K  =  16.  Lastly,  results  of  Experiments 
5-10  for  the  Letter  data  set  also  demonstrate  insensitivity  to  the  particular  choices  of  wavelets  used. 

For  the  Shuttle  data  set,  the  increase  in  K  does  not  appear  to  affect  performance  significantly. 
Performance  for  certain  classes  is  improved  while  others  worsened. 

Finally,  for  the  Satimage  data  set,  a  slight  performance  degradation  is  observed  in  Experiment 
14  with  the  use  of  the  full  data  dimension  versus  the  use  of  only  attributes  17-20  in  Experiment  13 
(cf.  Satimage  data  set  description  in  Section  5.3.1).  Also  noticed  is  a  small  degree  of  sensitivity  to 
the  choice  of  wavelet  types. 

The  experimental  results  presented  here  demonstrate  that  the  BTC  has  the  capability  of  classi¬ 
fying  suitable  publicly  available  data  sets.  While  performance  for  certain  classes  and  data  sets  are 
very  good,  performance  for  some  others  are  quite  bad.  This  can  possibly  be  attributed  to  the  nature 
of  the  data  sets  such  as  interclass  ambiguity.  The  below-average  performance  for  the  Letter  data 
set  could  possibly  be  due  to  the  preprocessing  that  was  done  to  the  original  letter  image  data  to 
which  we  have  not  found  access.  Performance  is  believed  to  improve  if  the  LDB -based  classifier 
is  applied  directly  to  these  original  images  rather  than  the  numerical  attributes  of  edge  counts  and 
statistical  moments  represented  in  these  data. 
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Data 

Set 

Size  of 
Training  Set 

Size  of 
Test  Set 

Number  of 

Classes 

Data 

Dimension 

Integer  Data 
Values 

DNA 

2000 

1186 

3 

180 

{0T} 

Letter 

15000 

5000 

26 

16 

[0,15] 

Shuttle 

43500 

14500 

7 

9 

[-26739,15164] 

Satimage 

4435 

2000 

6 

36 

[0,255] 

Table  5:  Some  parameters  pertaining  to  four  of  the  data  sets  obtained  from  the  StatLog  repository. 


Experiment 

Data 

Set 

Ntr 

Wavelet 

Data 

Dimension 

Processed  Data 

Dimension 

K 

J 

1 

DNA 

400 

Coifbt(5) 

60 

64 

16 

3 

2 

DNA 

400 

Coifbt(5) 

180 

256 

32 

8 

3 

Letter 

100 

Coifbt(5) 

16 

16 

8 

4 

4 

Letter 

500 

Coifbt(5) 

16 

16 

8 

4 

5 

Letter 

100 

Coifbt(5) 

16 

16 

16 

4 

6 

Letter 

100 

Beylkin(O) 

16 

16 

16 

4 

7 

Letter 

100 

Daubechies(4) 

16 

16 

16 

4 

8 

Letter 

500 

Coifbt(5) 

16 

16 

16 

4 

9 

Letter 

500 

Beylkin(O) 

16 

16 

16 

4 

10 

Letter 

500 

Daubechies(4) 

16 

16 

16 

4 

11 

Shuttle 

<  100 

Coifbt(5) 

9 

16 

4 

4 

12 

Shuttle 

<  100 

Coifbt(5) 

9 

16 

9 

4 

13 

Satimage 

400 

Coifbt(5) 

4 

8 

4 

4 

14 

Satimage 

400 

Coifbt(5) 

36 

64 

36 

6 

15 

Satimage 

400 

Beylkin(O) 

36 

64 

36 

6 

16 

Satimage 

400 

Daubechies(4) 

36 

64 

36 

6 

Table  6:  Summary  of  varied  parameters  in  the  StatLog  data  experiments. 
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Exp 

Data 

Set 

Ntr 

Wavelet 

Data 

Dimension 

K 

J 

P 

CCmin 

P 

±  CCmax 

Pcc 

1 

DNA 

400 

Coifbt(5) 

60 

16 

3 

0.71 

0.95 

0.82 

2 

DNA 

400 

Coifbt(5) 

180 

32 

8 

0.56 

0.74 

0.62 

3 

Letter 

100 

Coifbt(5) 

16 

8 

4 

0.33 

0.93 

0.51 

4 

Letter 

500 

Coifbt(5) 

16 

8 

4 

0.26 

0.93 

0.51 

5 

Letter 

100 

Coifbt(5) 

16 

16 

4 

0.18 

0.90 

0.53 

6 

Letter 

100 

Beylkin(O) 

16 

16 

4 

0.18 

0.90 

0.53 

7 

Letter 

100 

Daubechies(4) 

16 

16 

4 

0.18 

0.90 

0.53 

8 

Letter 

500 

Coifbt(5) 

16 

16 

4 

0.27 

0.90 

0.58 

9 

Letter 

500 

Beylkin(O) 

16 

16 

4 

0.27 

0.90 

0.58 

10 

Letter 

500 

Daubechies(4) 

16 

16 

4 

0.27 

0.90 

0.58 

11 

Shuttle 

<  100 

Coifbt(5) 

9 

4 

4 

0.28 

1.00 

0.69 

12 

Shuttle 

<  100 

Coifbt(5) 

9 

9 

4 

0.22 

1.00 

0.70 

13 

Satimage 

400 

Coifbt(5) 

4 

4 

4 

0.29 

0.98 

0.71 

14 

Satimage 

400 

Coifbt(5) 

36 

36 

6 

0.32 

1.00 

0.68 

15 

Satimage 

400 

BeylMn(O) 

36 

36 

6 

0.28 

1.00 

0.71 

16 

Satimage 

400 

Daubechies(4) 

36 

36 

6 

0.33 

0.96 

0.65 

Table  7:  Performance  summary  for  the  case  with  no  path  correction.  PCCmin  represents  the  proba¬ 
bility  of  correct  classification  for  the  class  that  is  classified  correctly  the  least  among  the  classes  in 
the  set  and  PCCrnax  represents  the  probability  of  correct  classification  for  the  class  that  is  classified 
correctly  the  most.  Pcc  is  the  overall  probability  of  correct  classification. 
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Parameter 

Expl 

Exp  2 

Wavelet  Type 

Coiffet 

Coiffet 

Wavelet  Parameter 

5 

5 

Feature  Length  K 

16 

32 

Number  of  Classes  C 

3 

3 

BTC  Wavelet  Tree  Depth  J 

3 

8 

Data  Dimension 

[1  60] 

[1  180] 

Processed  Data  Dimension 

[1  64] 

[1  256] 

Ntr 

400 

400 

Nte 

100 

100 

Table  8:  Parameters  for  Experiments  1  and  2:  DNA  data  set. 


Parameter 

Exp  3 

4 

5 

6 

7 

8 

9 

10 

Wavelet  Type 

Coiffet 

Coiffet 

Coiffet 

3eylkin 

Daub. 

:oiffet  i 

leylkin  I 

)aub. 

Wavelet  Parameter 

5 

5 

5 

0 

4 

5 

0 

4 

Feature  Length  K 

8 

8 

16 

16 

16 

16 

16 

16 

Number  of  Classes  C 

8 

8 

8 

8 

8 

8 

8 

8 

Wavelet  Tree  Depth  J 

4 

4 

4 

4 

4 

4 

4 

4 

Ntr 

100 

500 

100 

100 

100 

500 

500 

500 

Nte 

100 

100 

100 

100 

100 

100 

100 

100 

Table  9:  Experimental  parameters  for  Experiments  3-10:  Letter  data  set.  The  data  dimension  and 
processed  data  dimensions  are  [1  16]. 
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Parameter 

Exp  11 

Exp  12 

Wavelet  Type 

Coifbt 

Coifbt 

Wavelet  Parameter 

5 

5 

Feature  Length  K 

4 

9 

Number  of  Classes  C 

7 

7 

BTC  Wavelet  Tree  Depth  J 

4 

4 

Data  Dimension 

[1  9] 

[1  9] 

Processed  Data  Dimension 

[1  16] 

[1  16] 

Ntr 

<  100 

<  100 

Nte 

<  100 

<  100 

Table  10:  Experimental  parameters  for  Experiments  11  and  12:  Shuttle  data  set. 


Parameter 

Exp  13 

14 

15 

16 

Wavelet  Type 

Coifbt 

Coifbt 

Beylkin 

Daub. 

Wavelet  Parameter 

5 

5 

0 

4 

Feature  Length  K 

4 

36 

36 

36 

Number  of  Classes  C 

6 

6 

6 

6 

BTC  Wavelet  Tree  Depth  J 

4 

6 

6 

6 

Data  Dimension 

[1  4] 

[1  36] 

[1  36] 

[1  36] 

Processed  Data  Dimension 

[1  8] 

[1  64] 

[1  64] 

[1  64] 

Ntr 

400 

400 

400 

400 

Nte 

100 

100 

100 

100 

Table  11:  Experimental  parameters  for  Experiments  13-16:  Satimage  data  set. 
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Without  Path  Correction 


With  Path  Correction 


Figure  106:  Quality  measures  for  Experiment  1-DNA  data  set. 


Without  Path  Correction  With  Path  Correction 


Decision  Decision 


Figure  107:  Confusion  matrix  for  Experiment  1-DNA  data  set. 
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re  108:  Histogram  of  quality  measures  for  Experiment  1-DNA  data  set. 


Without  Path  Correction  With  Path  Correction 


Figure  109:  Quality  measures  for  Experiment  2-DNA  data  set. 
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Decision  Decision 


Figure  110:  Confusion  matrix  for  Experiment  2-DNA  data  set. 


Without  Path  Correction 


Quality  Measure 


With  Path  Correction 


Quality  Measure 


Figure  111:  Histogram  of  quality  measures  for  Experiment  2-DNA  data  set. 
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Without  Path  Correction 


With  Path  Correction 
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Figure  1 12:  Quality  measures  for  Experiment  3-Letter  data  set. 
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Figure  113:  Confusion  matrix  for  Experiment  3-Letter  data  set. 
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Without  Path  Correction 


Quality  Measure 


Figure  1 14:  Histogram  of  quality  measures  for  Experiment  3-Letter  data  set. 
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Figure  115:  Quality  measures  for  Experiment  4-Letter  data  set. 
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Figure  1 16:  Confusion  matrix  for  Experiment  4-Letter  data  set. 
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Figure  117:  Histogram  of  quality  measures  for  Experiment  4-Letter  data  set. 
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Figure  118:  Quality  measures  for  Experiment  5-Letter  data  set. 
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Figure  1 19:  Confusion  matrix  for  Experiment  5-Letter  data  set. 
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Without  Path  Correction 
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120:  Histogram  of  quality  measures  for  Experiment  5-Letter  data  set. 
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Figure  121:  Quality  measures  for  Experiment  6-Letter  data  set. 
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Figure  122:  Confusion  matrix  for  Experiment  6-Letter  data  set. 
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Figure  123:  Histogram  of  quality  measures  for  Experiment  6-Letter  data  set. 
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Figure  124:  Quality  measures  for  Experiment  7-Letter  data  set. 
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Figure  125:  Confusion  matrix  for  Experiment  7-Letter  data  set. 
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126:  Histogram  of  quality  measures  for  Experiment  7-Letter  data  set. 
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Figure  127:  Quality  measures  for  Experiment  8-Letter  data  set. 
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Figure  128:  Confusion  matrix  for  Experiment  8-Letter  data  set. 
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Figure  129:  Histogram  of  quality  measures  for  Experiment  8-Letter  data  set. 


ISP  Technical  Note 


260 


103 


True  Class  Quality  Measure 


Mission  Research  Corporation 


Integrated  Sensors  &  Processing 


Without  Path  Correction 


0  Qxy  (Pe  =  0.42375) 
0  Qxx  (Pcc  =  0.57625) 


200  400  600  800 

Trial  Index 


With  Path  Correction 


0.8 


0  Qxy  (Pe  =  0.42375) 
0  Qxx  (Pcc  =  0.57625) 


200  400  600 

Trial  Index 


800 


Figure  130:  Quality  measures  for  Experiment  9-Letter  data  set. 
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Figure  131:  Confusion  matrix  for  Experiment  9-Letter  data  set. 
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Figure  132:  Histogram  of  quality  measures  for  Experiment  9-Letter  data  set. 
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Figure  133:  Quality  measures  for  Experiment  10-Letter  data  set. 
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Figure  134:  Confusion  matrix  for  Experiment  10-Letter  data  set. 
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Figure  135:  Histogram  of  quality  measures  for  Experiment  10-Letter  data  set. 
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Figure  136:  Quality  measures  for  Experiment  1 1-Shuttle  data  set. 
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Figure  137:  Confusion  matrix  for  Experiment  11-Shuttle  data  set. 
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Figure  138:  Histogram  of  quality  measures  for  Experiment  11-Shuttle  data  set. 


Figure  139:  Quality  measures  for  Experiment  12-Shuttle  data  set. 
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Figure  140:  Confusion  matrix  for  Experiment  12-Shuttle  data  set. 
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Figure  141:  Histogram  of  quality  measures  for  Experiment  12-Shuttle  data  set. 
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dgurc  142:  Quality  measures  for  Experiment  13-Satimage  data  set. 
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Figure  143:  Confusion  matrix  for  Experiment  13-Satimage  data  set. 
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144:  Histogram  of  quality  measures  for  Experiment  13-Satimage  data  set. 
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Figure  145:  Quality  measures  for  Experiment  14-Satimage  data  set. 
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Figure  146:  Confusion  matrix  for  Experiment  14-Satimage  data  set. 
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Figure  147:  Histogram  of  quality  measures  for  Experiment  14-Satimage  data  set. 
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figure  148:  Quality  measures  for  Experiment  15-Satimage  data  set. 
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Figure  149:  Confusion  matrix  for  Experiment  15-Satimage  data  set. 
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150:  Histogram  of  quality  measures  for  Experiment  15-Satimage  data  set. 
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Figure  151:  Quality  measures  for  Experiment  16-Satimage  data  set. 
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Figure  152:  Confusion  matrix  for  Experiment  16-Satimage  data  set. 
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Figure  153:  Histogram  of  quality  measures  for  Experiment  16-Satimage  data  set. 
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6  Conclusions 

In  this  final  section,  we  provide  a  set  of  concise  conclusions  that  are  based  on  the  experimental 
study  of  this  report. 

1.  The  tree-based  classifiers  developed  in  [  ],  including  the  binary-tree  classifier  (BTC),  bi¬ 
nary  hypertree  classifier  (BHC),  and  binary  supertree  classifier  (BSC),  were  studied  using 
synthetic  and  collected  data  sets.  The  basic  operation  of  the  classifiers  was  validated. 

2.  The  algorithm  for  joint  automatic  blind  specification  of  the  tree’s  topology,  superclass  speci¬ 
fications,  and  feature- vector  parameters  results  in  classifiers  that  are  adept  at  embodying  the 
basic  ambiguity  structure  of  the  problem  at  hand.  This  was  demonstrated  with  synthetic  and 
collected  one-  and  two-dimensional  data  sets. 

3.  The  basic  performance  ordering  of  {BTC}  <  BHC  <  BSC  was  confirmed  for  the  synthetic 
two-dimensional  problem,  which  contained  severe  class  ambiguities. 

4.  The  basic  performance  ordering  was  not  confirmed  for  the  synthetic  one-dimensional  prob¬ 
lem  (involving  automatic  recognition  of  each  of  sixteen  maximal-length  shift-register  se¬ 
quences).  In  this  case,  an  individual  BTC  could  have  better  performance  than  the  BHC.  We 
conjecture  that  this  contradiction  of  our  mathematical  analysis  in  [  1]  arises  from  the  breaking 
of  our  fundamental  assumption  that  the  probability  of  error  at  a  decision  node  is  a  smooth 
function  of  the  node’s  ambiguity. 

5.  The  notion  of  path  correction  was  introduced  and  was  shown  to  dramatically  improve  per¬ 
formance  for  problems  involving  weak  class  ambiguities  and  to  have  little  effect  (positive  or 
negative)  for  problems  possessing  severe  class  ambiguities. 

6.  Further  work  should  focus  on  refinement  of  the  construction  of  a  BHC  from  constituent 
BTCs.  In  particular,  the  algorithm  needs  to  make  better  joint  use  of  node  ambiguity,  node 
feature- vector  strength,  and  local  tree  topology  in  order  to  improve  the  selection  of  nodes  to 
be  included  as  hypertree  jump  points. 
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